Overall: I directionally and conceptually agree with most of what is said in this post, and only highlight and comment on the things that I disagree about (or not fully agree, or find ontologically or conceptually somewhat off).
an agent minimizing its cross-entropy loss,
I understand this is not the point of your paper and is just an example, yet I want to use the opportunity to discuss it. The training loss is not the agent's surprise. Loss is more like a field force that helps the agent to stay in its nich... (read more)
Deutsch delves into this topic in great depth in "The Fabric of Reality". There are mathematical objects that "exist in the abstract", and the pen-and-paper proofs or in Mathematica alike are not categorically different from computer simulations that evidence that a so-and-so physical system will behave in so-and-so way, but don't "prove" it.
The idea that alignment research constitutes some kind of "search for a solution" within the discipline of cognitive science is wrong.
Let's stick with the Open Agency Architecture (OAA), not because I particularly endorse it (I have nothing to say about it actually), but because it suits my purposes.
We need to predict the characteristics that civilisational intelligence architectures like OAA will have, characteristics such as robustness of alignment with humans, robustness/resilience in a more general sense (for example, in the face of external shocks, su... (read more)
However, recent events have revealed that what we actually need is lots of minds pointed at random directions, and some of them will randomly get lucky and end up pointed in the right place at the right time. Some people will still have larger search spaces than others, but it's still vastly more egalitarian than what anyone expected ~4 years ago.
The idea that we need many different people to "poke at the alignment problem in random directions" implicitly presupposes that alignment (technical, at least) is a sort of mathematical problem that co... (read more)
No, just a piece of the puzzle of a more salient understanding of AI self-control that I want to outline, which should integrate ML, cognitive science, theory of consciousness, control theory/resilience theory, and dynamical systems theory/stability theory.
Only this sort of understanding could make the discussion of oracle AI vs. agent AI agendas really substantiated, IMO.
ACI learns to behave the same way as examples, so it can also learn ethics from examples. For example, if behaviors like “getting into a very cold environment” is excluded from all the examples, either by natural selection or artificial selection, an ACI agent can learn ethics like “always getting away from cold”, and use it in the future. If you want to achieve new ethics, you have to either induce from the old ones or learn from selection in something like “supervised stages”.
You didn't respond to the critical part of my comment: "However, after removing... (read more)
That LW post review interface is a thin veneer, just making pre-publication feedback a tiny bit more convenient for some people (unless they already use Google Docs, Notion, or other systems to draft their posts which offer exactly the same interface).
However, this is not a proper, scientific peer review system that fosters better research collaboration and accumulation of knowledge. I wrote about this recently here.
Overall I believe that this is a hard problem and probably others have already thought about it.
I'm not sure people seriously thought about this before, your perspective seems rather novel.
I think existing labs themselves are the best vehicle to groom new senior researchers. Anthropic, Redwood Research, ARC, and probably other labs were all found by ex-staff of existing labs at the time (except that maybe one shouldn't credit OpenAI for "grooming" Paul Cristiano to senior level, but anyways).
It's unclear what field-building projects could incentivise labs ... (read more)
The system that I proposed is simpler, it doesn't have fine grained and selective access, and therefore continuous efforts on the part of some people for "connecting the dots". It's just a single space, basically like the internal Notion + Slack space + Google Drive of the AI safety lab that would lead this project. On this space, people can share research, ideas, have "mildly" infohazardous discussions such as regarding the pros and cons of different approaches to building AGI.
I cannot imagine that system would end up unused. At least three people (you, m... (read more)
I don't understand. Thinking doesn't happen successfully? Is it successful on LW though, and by what measure?
I think that in the ACI model, you correctly capture that agents are not bestowed with the notion of good and bad "from above", as in the AIXI model (implicitly, encoded as rewards). ACI appears to be an uncomputable, idealised version of predictive processing.
However, after removing ethics "from the outside", ACI is left without an adequate replacement. I. e., this is an agent devoid of ethics as a cognitive discipline, which appears to be intimately related to foresight. ACI lacks constructive foresight, too, it always "looks back", which warranted perio... (read more)
I have the same sentiment as you. I wrote about this here: Has private AGI research made independent safety research ineffective already? What should we do about this?
it did not appear to actually solve the problems any specific person I contacted was having.
I think it's important to realise (including for the people whom you spoke to) that we are not in the business of solving specific problems people that researchers have individually (or as a small research group), but a collective coordination problem, i. e., a problem with the design of the collective, civilisational project of developing non-catastrophic AI.
I wrote a post about this.
Thanks for pointing this out, I've fixed the post
I argue for the former in the section "Linguistic capability circuits inside LLM-based AI could be sufficient for approximating general intelligence". Insisting that AGI action must be a single Transformer inference is pointless: sure, The Bitter Lesson suggests that things will eventually converge in that direction, but first AGI will unlikely be like that.
Attention dilution, exactly. Ultimately, I want (because I think this will be more effective) all relevant work to be syndicated on LW/AF (via Linkposts, and review posts), not the other way around: AI safety researchers had to subscribe to arxiv sanity, google AI blog, all relevant standalone blogs such as Bengio's and Scott Aaronson's, etc. etc. etc., all by themselves and separately.
I even think if LW hired part-time staff dedicated to doing this would be very valuable.
Also, alignment newsletters, to further pre-process information, don't live. Shah tri... (read more)
I strongly agree with most of this.
Did you see LeCun's proposal about how to improve academic review here? It strikes me as very good and I'd love if AI safety/x-risk community had a system like this.
I'm suspicious about creating a separate journal, rather than concentrating efforts around existing institutions: LW/AF. I think it would be better to fund LW exactly for this purpose and add monetary incentives for providing good reviews of research writing on LW/AF (and, of course, the research writing itself could be incentivised in this way, too).
Then, tur... (read more)
I feel that LW is quite bad as the system for performing AI safety research. Most likely worse than the traditional system of academic publishing, in aggregate. A random list of things that I find unhelpful:
I suspect future language models will have beliefs in a more meaningful sense than current language models, but I don’t know in what sense exactly, and I don’t think this is necessarily essential for our purposes.
In Active Inference terms, the activations within current LLMs upon processing the context parameterise LLM's predictions of the observations (future text), Q(y|x), where x is internal world model states, and y are expected observations -- future tokens. So current LLMs do have beliefs.
Worry 2: Even if GPT-n develops “beliefs” in a meaningful
Please also take into account that TAI (~= AGI) is not the same thing as ASI. It's now perhaps within the Overton window for normie traders to imagine the arrival of AGI/TAI that will radically automate the economy and will unlock abundance, but imagining that this stage will, perhaps, very soon will be followed by ASI is still outside the Overton window. For example, in the public discourse, there is plenty of discussion of powerful AI taking away human jobs (more often than not with the connotation that it will simultaneously create much more jobs and th... (read more)
One actor having all the money and economic power in the world all all the rest "trashed" is not a coherent situation, under modern economics. For the products, businesses, services, etc. to be outcompeted you still need the trade happening, and this trade should necessarily be bidirectional.
If trade ceases it either means that the whole world is converted in a single corporation under the command of a single ASI (in which case the conventional market economics have ended, no more money and company valuations), or ASI just decouples from the rest of the ec... (read more)
The relevant paragraph that I quoted refutes exactly this. In the bolded sentence, "value function" is used as a synonym to "utility function". You simply cannot represent an agent that always seeks to maximise "empowerment" (as defined in the paper for Self-preserving agents), for example, or always seeks to minimise free energy (as in Active Inference agents), as maximising some quantity over its lifetime: if you integrate empowerment or free energy over time you don't get a sensible information quantity that you can label as "utility".
This is an uncontr... (read more)
Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning
I think that adding new types of systems and agents to the universe changes the optimal "applied ethics" in the situation (I wrote about this here, in the "PS.", the last paragraph), so they only hope for the discriminator to be 1) a general intelligence; 2) using a scale-free, naturalistic theory of ethics as a theoretical discipline for evaluating any applied ethics theories in any situations and contexts.
Also, hopefully, the "least wrong" scale-free ethics is "aligned" with humans, in the sense that it "saves" us from oblivion. For example, a version of... (read more)
You are talking about a "verifier for explanations". I don't know how an explanation could be verified under constructivist epistemology and pragmatist meta-epistemology.
I've recently thought about the relationship between GFlowNets and constructivism. Here're some excerpts, unedited, but hopefully could be helpful in some way to someone.
It’s interesting that GFlowNets suggest constructing a trajectory, i. e., an explanation, rather than sampling it via Markov Chain Monte Carlo (MCMC) methods (e. g. Monte-Carlo Tree Search), as suggested in the current Act... (read more)
You are talking about what I would call a phenomenological, or "philosophical-in-the-hard-problem-sense" consciousness ("phenomenological" is also not quite right the word because psychology is also phenomenology, relative to neuroscience, but this is an aside).
"Psychological" consciousness (specifically, two kinds of it: affective/basal/core consciousness, and access consciousness) is not mysterious at all. These are just normal objects in neuropsychology.
Corresponding objects could also be found in AIs, and called "interpretable AI consciousness".
"Psycho... (read more)
I talked about psychologists-scientists, not psychologists-therapists. I think psychologists-scientists should have unusually good imaginations about the potential inner workings of other minds, which many ML engineers probably lack. I think it's in principle possible for psychologists-scientists to understand all mech. interpretability papers in ML that are being published on the necessary level of detail. Developing the imaginations about inner workings of other minds in ML engineers could be harder.
That being said, as de-facto the only scientifically gr... (read more)
Behavioural psychology of AI should be an empirical field of study. Methodologically, the progression is reversed:
Hello Igor, I have some thoughts about this, let's discuss on telegram? My account is t.me/leventov. Or, if you prefer other modes of communication, please let me know.
This is my comment about Pedro Domingos' thinking about AI alignment (in this video interview for the "Machine Learning Street Talk" channel:
also sprawling into adjacent themes (in which I push "my agenda", of course):
Thank you Tim for starting consistently bringing up the theme of "alignment" (although I disagree with this framing of the problem, quite similarly to Pedro actually, and for this reason prefer the term "AI safety"; more on this in the end of this comment) in your conversations with AI scientists who are not focused on this. This ... (read more)
Perhaps, these beliefs should be inferred from a large corpus of prompts such as "What are you?", "Who are you?", "How were you created?", "What is your function?", "What is your purpose?", etc., cross-validated with some ELK techniques and some interpretability techniques.
Whether these beliefs about oneself translate into self-awareness (self-evidencing) is a separate, albeit related question. As I wrote here:
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during infer
I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don't exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don't even possess such a concept in the first place.
You mean, by realising that there are online forums that are referenced elsewhere in the training corpus, yet themselves are conspicuously absent from the training data (which can be detected, say, as relative isolation of the features corresponding to these concepts, which means that the data around these concepts is purposefully removed from the training data)? And then these connections are added during fine-tuning when this forum data is finally added to the fine-tuning dataset? I still don't see how this will let the network know it's in training vs. deployment.
Simulation is not what I meant.
Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond t
I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously.
If you take the proposal seriously (I do): the limit appears to be "uninteresting". A maximally entangled system is "nothing", it's perceptibly indistinguishable from its environment, for a third-person observer (let's say, in Tegmark's tripartite partition system-environment-observer). ... (read more)
You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don't think this is the case.
In classical and quantum derivations of the Free Energy Principle, it is shown that the limit is the perfect predictive capability of the agent's environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it's more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the... (read more)
I don't think consequentialism is related to utility maximisation in the way you try to present it. There are many consequentialistic agent architectures that are explicitly not utility maximising, e. g. Active Inference, JEPA, ReduNets.
Then you seem to switch your response to discussing that consequentialism is important for reaching the far-superhuman AI level. This looks at least plausible to me, but first, these far-superhuman AIs could have a non-UM consequentialistic agent architecture (see above), and second, DragonGod didn't say that the risk is ne... (read more)
Agree with everything, including the crucial conclusion that thinking and writing about utility maximisation is counterproductive.
Just one minor thing that I disagree with in this post: while simulators as a mathematical abstraction are not agents, the physical systems that are simulators in our world, e. g. LLMs, are agents.
An attempt to answer the question in the title of this post, although that could be a rhetorical one:
I see the appeal. When I was writing the post, I even wanted to include a second call for action: exclude LW and AF from the training corpus. Then I realised the problem: the whole story of "making AI solve alignment for us" (which is currently in the OpenAI's strategy: [Link] Why I’m optimistic about OpenAI’s alignment approach) depends on LLMs knowing all this ML and alignment stuff.
There are further possibilities: e. g., can we fine-tune a model, which is generally trained without LW and AF data (and other relevant data - as with your suggested filter) ... (read more)
Agreed. To be consistently "helpful, honest, and harmless", LLM should somehow "keep this on the back of its mind" when it assists the person, or else it risks violating these desiderata.
In DNN LLMs, "keeping something in the back of the mind" is equivalent to activating the corresponding feature (of "HHH assistant", in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.
The fact that large models interpret the "HHH Assistant" as such a character is interesting, but it doesn't imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which corr... (read more)
To merge with agency?
Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even
Probably this opinion of LWers is shaped by their experience communicating with outsiders. Almost all my attempts to communicate AI x-risk to outsiders, from family members to friends to random acquaintances, have not been understood for sure. Your experience (talking to random people at social events, walking away from you with the thought "AI x-risk is indeed a thing!", and starting to worry about it in the slightest afterwards) is highly surprising to me. Maybe there is a huge bias in this regard in the Bay Area, where even normal people generally under... (read more)
on those terms, most of the malicious intentions could be ruled out
Don't understand this, could you please elaborate?
"Physical" stands no chance against "informational" development because moving electrons and photons is so much more efficient than moving atoms.
The juiciest (and terrible) realisation from the essay for me: because AI companies can move easily, it will be hard for regulators to press or restrict them because AI companies will threaten to move to other jurisdictions that don't care. And here, the economic (and, ultimately, power) competition between countries seriously undermines (if not wholly destroys) attempts for global coordination on the regulation of AI.
My takes on this scenario overall:
situational awareness (which enables the model to reason about its goals)
Terminological note: intuitively, situational awareness means understanding oneself existing inside a training process. The ability to reason about one's own goals would be more appropriately called "(reflective) goal awareness".
We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness aris
"Purely epistemic model" is not a thing, everything is an agent that is self-evidencing at least to some degree: https://www.lesswrong.com/posts/oSPhmfnMGgGrpe7ib/properties-of-current-ais-and-some-predictions-of-the. I agree, however, that RLHF actively strengthens goal-directedness (the synonym of self-evidencing) which may otherwise remain almost rudimentary in LLMs.