All of Roman Leventov's Comments + Replies

Overall: I directionally and conceptually agree with most of what is said in this post, and only highlight and comment on the things that I disagree about (or not fully agree, or find ontologically or conceptually somewhat off).

AI agents shouldn't be modelled as minimising loss

  • an agent minimizing its cross-entropy loss,

I understand this is not the point of your paper and is just an example, yet I want to use the opportunity to discuss it. The training loss is not the agent's surprise. Loss is more like a field force that helps the agent to stay in its nich... (read more)

Deutsch delves into this topic in great depth in "The Fabric of Reality". There are mathematical objects that "exist in the abstract", and the pen-and-paper proofs or in Mathematica alike are not categorically different from computer simulations that evidence that a so-and-so physical system will behave in so-and-so way, but don't "prove" it. 

The idea that alignment research constitutes some kind of "search for a solution" within the discipline of cognitive science is wrong.

Let's stick with the Open Agency Architecture (OAA), not because I particularly endorse it (I have nothing to say about it actually), but because it suits my purposes.

We need to predict the characteristics that civilisational intelligence architectures like OAA will have, characteristics such as robustness of alignment with humans, robustness/resilience in a more general sense (for example, in the face of external shocks, su... (read more)

However, recent events have revealed that what we actually need is lots of minds pointed at random directions, and some of them will randomly get lucky and end up pointed in the right place at the right time. Some people will still have larger search spaces than others, but it's still vastly more egalitarian than what anyone expected ~4 years ago.

What events?

The idea that we need many different people to "poke at the alignment problem in random directions" implicitly presupposes that alignment (technical, at least) is a sort of mathematical problem that co... (read more)

1Radford Neal2d
"This cannot possibly be pulled off by a sole researcher or even a small group." Maybe you're right about this.  Or maybe you're not.  When tackling a difficult problem with no clear idea how to find a solution, it's not a good idea to narrow the search space for no compelling reason.

No, just a piece of the puzzle of a more salient understanding of AI self-control that I want to outline, which should integrate ML, cognitive science, theory of consciousness, control theory/resilience theory, and dynamical systems theory/stability theory.

Only this sort of understanding could make the discussion of oracle AI vs. agent AI agendas really substantiated, IMO. 

3the gears to ascenscion3d
makes sense. Are you familiar with Structured State Spaces and followups? []   

ACI learns to behave the same way as examples, so it can also learn ethics from examples. For example, if behaviors like “getting into a very cold environment” is excluded from all the examples, either by natural selection or artificial selection, an ACI agent can learn ethics like “always getting away from cold”, and use it in the future. If you want to achieve new ethics, you have to either induce from the old ones or learn from selection in something like “supervised stages”.

You didn't respond to the critical part of my comment: "However, after removing... (read more)

1Akira Pyinya8d
I think I have already responded to that part. Who is the “caretaker that will decide what, when and how to teach the ACI”? The answer is natural selection or artificial selection, which work like filters. AIXI’s “constructive, normative aspect of intelligence is ‘assumed away’ to the external entity that assigns rewards to different outcomes”, while ACI’s constructive, normative aspect of intelligence is also assumed away to the environment that have determined which behavior was OK and which behavior would get a possible ancestor out of the gene pool. Since the the reward circuit of natural intelligence is shaped by natural selection, ACI is also eligible to be a universal model of intelligence.   Thank you for your correction about Active Inference reading, I will read more then respond to that.

That LW post review interface is a thin veneer, just making pre-publication feedback a tiny bit more convenient for some people (unless they already use Google Docs, Notion, or other systems to draft their posts which offer exactly the same interface).

However, this is not a proper, scientific peer review system that fosters better research collaboration and accumulation of knowledge. I wrote about this recently here.

This sounds like you never used the feedback button. If you press the feedback button you used to get a text: Lightcone pays a person to provide this service.

Overall I believe that this is a hard problem and probably others have already thought about it.

I'm not sure people seriously thought about this before, your perspective seems rather novel.

I think existing labs themselves are the best vehicle to groom new senior researchers. Anthropic, Redwood Research, ARC, and probably other labs were all found by ex-staff of existing labs at the time (except that maybe one shouldn't credit OpenAI for "grooming" Paul Cristiano to senior level, but anyways).

It's unclear what field-building projects could incentivise labs ... (read more)

The system that I proposed is simpler, it doesn't have fine grained and selective access, and therefore continuous efforts on the part of some people for "connecting the dots". It's just a single space, basically like the internal Notion + Slack space + Google Drive of the AI safety lab that would lead this project. On this space, people can share research, ideas, have "mildly" infohazardous discussions such as regarding the pros and cons of different approaches to building AGI.

I cannot imagine that system would end up unused. At least three people (you, m... (read more)

I don't understand. Thinking doesn't happen successfully? Is it successful on LW though, and by what measure?

Regarding AGI R&D strategy and coordination questions. I've not seen one realistic proposal by "leading figures" in the field and AI safety organisations. Beyond these people and organisations, I've seen even less thinking about it at all. Take the complete collapse of movement on the UN GGE on LAWS, only a slither of possible AI development and use, that should be the yardstick for people when thinking about AGI R&D strategy and coordination, and it has mostly failed. 

I think that in the ACI model, you correctly capture that agents are not bestowed with the notion of good and bad "from above", as in the AIXI model (implicitly, encoded as rewards). ACI appears to be an uncomputable, idealised version of predictive processing.

However, after removing ethics "from the outside", ACI is left without an adequate replacement. I. e., this is an agent devoid of ethics as a cognitive discipline, which appears to be intimately related to foresight. ACI lacks constructive foresight, too, it always "looks back", which warranted perio... (read more)

1Akira Pyinya9d
Thank you for your comment. I have spent some time reading the book Active Inference []. I think active inference is a great theory, but focuses on some aspects of intelligence different from what ACI does. ACI learns to behave the same way as examples, so it can also learn ethics from examples. For example, if behaviors like “getting into a very cold environment” is excluded from all the examples, either by natural selection or artificial selection, an ACI agent can learn ethics like “always getting away from cold”, and use it in the future. If you want to achieve new ethics, you have to either induce from the old ones or learn from selection in something like “supervised stages”. Unlike Active Inference or the “AIXI+ValueLearning” combination, ACI does not divide the learning process into “information gain” and “pragmatic value learning”, but learns them as a whole. For example, bacterias can learn policies like following nutrient gradients from successful behaviors proved by natural selection. The problem of dividing the learning process is that, without value learning, we don’t know what information is important and need to learn, but without enough information it would be difficult to understand values. That’s why active inference indicates that “pragmatic and epistemic values need to be pursued in tandem”. However, the ACI model works in a little different way, it induces and updates policies directly from examples, and practices epistemic learning only when the policy asks to, such as when the policy involves pursuing some goal states.  In the active inference model, both information gain and action are considered as “minimizing the discrepancy between our model and our world through perception and action”. For example, when a person senses his body temperature is much higher than expected, he should change his model of body temperature, or take action to lower his body temperature. He always chooses

it did not appear to actually solve the problems any specific person I contacted was having.

I think it's important to realise (including for the people whom you spoke to) that we are not in the business of solving specific problems people that researchers have individually (or as a small research group), but a collective coordination problem, i. e., a problem with the design of the collective, civilisational project of developing non-catastrophic AI.

I wrote a post about this.

True! I just think the specific system I proposed required: 1.  significant time investments on the part of organizers (requiring intrinsic interest or funding for individuals with the requisite knowledge and trustworthiness) 2. a critical mass of users (requiring that a nontrivial fraction of people would find some value in the system) The people who could serve as the higher level organizers are few and are typically doing other stuff, and a poll of a dozen people coming back with zero enthusiastic takers makes 2 seem iffy. Default expectation is that the system as described would just end up unused. I'm pretty sure there exists some system design that would fare better, so I definitely encourage poking at this type of thing!

Thanks for pointing this out, I've fixed the post

I argue for the former in the section "Linguistic capability circuits inside LLM-based AI could be sufficient for approximating general intelligence". Insisting that AGI action must be a single Transformer inference is pointless: sure, The Bitter Lesson suggests that things will eventually converge in that direction, but first AGI will unlikely be like that.

Then I misread this section as arguing that LLM could yada yada, not that it was likely. Would you like to bet? Yes, we agree not to care about completing single inference with what I called more or less minor tricks, like using a context document [] telling to play the role of, say, a three-headed lizardwoman from Venus (say it fits your parental caring needs better than Her []).

Attention dilution, exactly. Ultimately, I want (because I think this will be more effective) all relevant work to be syndicated on LW/AF (via Linkposts, and review posts), not the other way around: AI safety researchers had to subscribe to arxiv sanity, google AI blog, all relevant standalone blogs such as Bengio's and Scott Aaronson's, etc. etc. etc., all by themselves and separately.

I even think if LW hired part-time staff dedicated to doing this would be very valuable.

Also, alignment newsletters, to further pre-process information, don't live. Shah tri... (read more)

1Seb Farquhar19d
FWIW I think doing something like the newsletter well actually does take very rare skills. Summarizing well is really hard. Having relevant/interesting opinions about the papers is even harder.

I strongly agree with most of this.

Did you see LeCun's proposal about how to improve academic review here? It strikes me as very good and I'd love if AI safety/x-risk community had a system like this.

I'm suspicious about creating a separate journal, rather than concentrating efforts around existing institutions: LW/AF. I think it would be better to fund LW exactly for this purpose and add monetary incentives for providing good reviews of research writing on LW/AF (and, of course, the research writing itself could be incentivised in this way, too).

Then, tur... (read more)

1Seb Farquhar20d
Yeah, LeCun's proposal seems interesting. I was actually involved in an attempt to modify OpenReview to push along those lines a couple years ago. But it became very much a 'perfect is the enemy of the good' situation where the technical complexity grew too fast relative to the amount of engineering effort devoted to it. What makes you suspicious about a separate journal? Diluting attention? Hard to make new things? Or something else? I'm sympathetic to diluting attention, but bet that making a new thing wouldn't be that hard.

I feel that LW is quite bad as the system for performing AI safety research. Most likely worse than the traditional system of academic publishing, in aggregate. A random list of things that I find unhelpful:

  • Very short “attention timespan”. People read posts within a couple of days from publishing unless curated (but curation is not a solution, because it also either happens or not within a short time window, and is a subjective judgement of a few moderators), perhaps within a few weeks, unless hugely upvoted. And a big “wave of upvotes” is also a som
... (read more)
I have definitely been neglecting engineering and mechanism design for the AI Alignment Forum for quite a while, so concrete ideas for how to reform things are quite welcome. I also think things aren't working that well, though my guess is my current top criticisms are quite different from yours.

I suspect future language models will have beliefs in a more meaningful sense than current language models, but I don’t know in what sense exactly, and I don’t think this is necessarily essential for our purposes.

In Active Inference terms, the activations within current LLMs upon processing the context parameterise LLM's predictions of the observations (future text), Q(y|x), where x is internal world model states, and y are expected observations -- future tokens. So current LLMs do have beliefs.

Worry 2: Even if GPT-n develops “beliefs” in a meaningful

... (read more)

Please also take into account that TAI (~= AGI) is not the same thing as ASI. It's now perhaps within the Overton window for normie traders to imagine the arrival of AGI/TAI that will radically automate the economy and will unlock abundance, but imagining that this stage will, perhaps, very soon will be followed by ASI is still outside the Overton window. For example, in the public discourse, there is plenty of discussion of powerful AI taking away human jobs (more often than not with the connotation that it will simultaneously create much more jobs and th... (read more)

You make good/interesting points: 1) About AGI being different from ASI: basically this is the question of how fast we go from AGI to ASI i.e. how fast is the takeoff. This is debated and no one can exactly predict how much time it will take i.e. if it would/will be a slow/soft takeoff or a fast/hard takeoff. The question of what happens economically during the AGI to ASI takeoff is also difficult to predict. It would/will depend on what (the entity controlling) the self-improving AGI decides to do, how market actors are impacted, if they can adapt to it or not, government intervention (if the AGI/ASI makes it possible), etc... 2) With regard to the impact of an ASI on the economic world and society I would distinguish between  2a) The digital/knowledge/etc... economy basically everything that can be thought of as "data processing" that can be done by computing devices: an ASI could take over all of that very quickly. 2b) The "physical" economy... i.e. basically everything that can be thought as "matter processing" that can be done by human bodies, machines, robots, ...: an ASI could take over all of that but it would take more time than the digital world of course as the ASI would/will need to produce many machines/robots/etc... and there could indeed be a bottleneck in terms of resources and laws of physics but if you imagine that the ASI would quickly master fast space travel, fast asteroid mining, fast nuclear fusion, fast robot production, etc...  it might not take that long neither. The question of what would happen economically while this happens is also difficult to predict. Traditional/existing economic actors could for example just basically stop as soon as the ASI starts providing any imaginable amount of great quality goods and services to any living entities if the ASI is benovolent/utilitarian (within the constraints of the laws of physics if it is in the real/physical world and if the ASI don't find ways to overcome the laws of physics in the re

One actor having all the money and economic power in the world all all the rest "trashed" is not a coherent situation, under modern economics. For the products, businesses, services, etc. to be outcompeted you still need the trade happening, and this trade should necessarily be bidirectional.

If trade ceases it either means that the whole world is converted in a single corporation under the command of a single ASI (in which case the conventional market economics have ended, no more money and company valuations), or ASI just decouples from the rest of the ec... (read more)

Thank you for your interesting answer :) I agree that in all likelihood a TS/ASI would be very disruptive for the economy. Under some possible scenarios it would benefit most economic actors (existing and new) and lead to a general market boom. But under some other possible scenarios (like for example as you mentioned a monopolistic single corporation swallowing up all the economic activity under the command of a single ASI) it would lead to an economic and market crash for all the other economic actors. Note that a permanent economic and market crash would not necessarily mean that the standards of living would not drastically improve, in this scenario (the monopolistic ASI) the standards of living would not depend on an economic and market crash but on how benevolent/utilitarian the entity controlling the ASI is. In economic/market terms there are plenty of possible scenarios depending mostly on what the entity controlling the ASI decides to do with regard to economic trade which is indeed the key word here as you rightly mentioned. Given that it is imho impossible (or at least very speculative) to predict which economic trade configuration and economic scenario would be the most likely to emerge, it is also impossible (or at least very speculative) to predict what the interest rates would become (if they still exist at all). So to come back to the original question about EMH and AGI/ASI/TS, as it is imho impossible (or very speculative) to predict the economic scenario that will emerge in case of the emergence of an AGI/ASI, the EMH is kept safe by the markets currently not taking into account what impact an AGI/ASI will have on interest rates. Note that, as mentioned, imho, in case of an AGI/ASI/TS the standards of living would not depend on an economic and market boom or crash but on how benevolent/utilitarian the (entity controlling the) AGI/ASI is.
  1. No. The economy and the market will be booming up until (and if) a complete transition out of the monetary system happens, including in the presence of ASI.
Thank you for your answer :) Imho there will definitely be a flood of already existing products and services being produced at rock-bottom prices and a flood of new products and services at cheap prices, etc... coming from the entity having created / in control of the ASI, but will that make the economy as a whole booming? I am not sure. To take an analogy imagine a whole new country appears from under the ocean (or an extraterrestrial alien spaceship, etc...) and flood the rest of the world with very cheap products and services as well as new products and services etc... at very cheap prices completely outcompeting each and every company in the rest of the world, what would that mean for the economy of the rest of the world: absolutely trashed, wouldn't it? All the companies, workers, means of production of the rest of the world would become very quickly valueless, even commodities as the SI could, if it wanted to, get them in almost unlimited quantity from outerspace and nuclear fusion etc... Maybe Earth land would still have some value for people who enjoy living / traveling there rather than on giant artificial Earth satellites, etc... The company having created / in control of ASI will be economically booming (in the short term at least) for sure but the rest of the economy and markets completely outcompeted by it, I am not sure, it would depend if the company having created / in control of ASI is willing to share some of its economic value / activity to the rest of the world or just quickly absorb all the economic activity into an economic singularity. What do you think?

The relevant paragraph that I quoted refutes exactly this. In the bolded sentence, "value function" is used as a synonym to "utility function". You simply cannot represent an agent that always seeks to maximise "empowerment" (as defined in the paper for Self-preserving agents), for example, or always seeks to minimise free energy (as in Active Inference agents), as maximising some quantity over its lifetime: if you integrate empowerment or free energy over time you don't get a sensible information quantity that you can label as "utility".

This is an uncontr... (read more)

I think that adding new types of systems and agents to the universe changes the optimal "applied ethics" in the situation (I wrote about this here, in the "PS.", the last paragraph), so they only hope for the discriminator to be 1) a general intelligence; 2) using a scale-free, naturalistic theory of ethics as a theoretical discipline for evaluating any applied ethics theories in any situations and contexts.

Also, hopefully, the "least wrong" scale-free ethics is "aligned" with humans, in the sense that it "saves" us from oblivion. For example, a version of... (read more)

You are talking about a "verifier for explanations". I don't know how an explanation could be verified under constructivist epistemology and pragmatist meta-epistemology.

I've recently thought about the relationship between GFlowNets and constructivism. Here're some excerpts, unedited, but hopefully could be helpful in some way to someone.

It’s interesting that GFlowNets suggest constructing a trajectory, i. e., an explanation, rather than sampling it via Markov Chain Monte Carlo (MCMC) methods (e. g. Monte-Carlo Tree Search), as suggested in the current Act... (read more)

To clarify: by a "verifier for explanations" I mostly mean something like a heuristic estimator as introduced in Formalizing the Presumption of Independence [] (or else something even further from formality that would fill a similar role).

You are talking about what I would call a phenomenological, or "philosophical-in-the-hard-problem-sense" consciousness ("phenomenological" is also not quite right the word because psychology is also phenomenology, relative to neuroscience, but this is an aside).

"Psychological" consciousness (specifically, two kinds of it: affective/basal/core consciousness, and access consciousness) is not mysterious at all. These are just normal objects in neuropsychology.

Corresponding objects could also be found in AIs, and called "interpretable AI consciousness".

"Psycho... (read more)

I talked about psychologists-scientists, not psychologists-therapists. I think psychologists-scientists should have unusually good imaginations about the potential inner workings of other minds, which many ML engineers probably lack. I think it's in principle possible for psychologists-scientists to understand all mech. interpretability papers in ML that are being published on the necessary level of detail. Developing the imaginations about inner workings of other minds in ML engineers could be harder.

That being said, as de-facto the only scientifically gr... (read more)

Thanks. That's not clear to me, given that AI systems are so unlike human minds. 

Behavioural psychology of AI should be an empirical field of study. Methodologically, the progression is reversed:

  1. Accumulate evidence about AI behaviour
  2. Propose theories that compactly describe (some aspects of) AI behaviour, and are simultaneously more specific (and more predictive) than "it just predicts the next most probable token". By this logic, we can say "it just follows along the unitary evolution of the universe".
  3. Cross-validate the theories of mechanistic interpretability ("AI neuroscience") and AI psychology with each other, just as human neurosc
... (read more)

Hello Igor, I have some thoughts about this, let's discuss on telegram? My account is Or, if you prefer other modes of communication, please let me know.

This is my comment about Pedro Domingos' thinking about AI alignment (in this video interview for the "Machine Learning Street Talk" channel: 

 also sprawling into adjacent themes (in which I push "my agenda", of course):

Thank you Tim for starting consistently bringing up the theme of "alignment" (although I disagree with this framing of the problem, quite similarly to Pedro actually, and for this reason prefer the term "AI safety"; more on this in the end of this comment) in your conversations with AI scientists who are not focused on this. This ... (read more)

Perhaps, these beliefs should be inferred from a large corpus of prompts such as "What are you?", "Who are you?", "How were you created?", "What is your function?", "What is your purpose?", etc., cross-validated with some ELK techniques and some interpretability techniques.

Whether these beliefs about oneself translate into self-awareness (self-evidencing) is a separate, albeit related question. As I wrote here:

In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during infer

... (read more)

I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don't exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don't even possess such a concept in the first place.

You mean, by realising that there are online forums that are referenced elsewhere in the training corpus, yet themselves are conspicuously absent from the training data (which can be detected, say, as relative isolation of the features corresponding to these concepts, which means that the data around these concepts is purposefully removed from the training data)? And then these connections are added during fine-tuning when this forum data is finally added to the fine-tuning dataset? I still don't see how this will let the network know it's in training vs. deployment.

Simulation is not what I meant.

Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:

In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond t

... (read more)

I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously.

If you take the proposal seriously (I do): the limit appears to be "uninteresting". A maximally entangled system is "nothing", it's perceptibly indistinguishable from its environment, for a third-person observer (let's say, in Tegmark's tripartite partition system-environment-observer). ... (read more)

I'm talking about reflective stability. Are you saying that all agents will eventually self modify into FEP, and FEP is a rock? 

You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don't think this is the case.

In classical and quantum derivations of the Free Energy Principle, it is shown that the limit is the perfect predictive capability of the agent's environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it's more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the... (read more)

What you've said so far doesn't seem to address my comments, or make it clear to me what the relevant of the FEP is. I also don't understand the FEP or the point of the FEP. I'm not saying EU maximizers are reflectively stable or a limit of agency, I'm saying that EU maximization is the least obviously reflectively unstable thing I'm aware of. 

I don't think consequentialism is related to utility maximisation in the way you try to present it. There are many consequentialistic agent architectures that are explicitly not utility maximising, e. g. Active Inference, JEPA, ReduNets.

Then you seem to switch your response to discussing that consequentialism is important for reaching the far-superhuman AI level. This looks at least plausible to me, but first, these far-superhuman AIs could have a non-UM consequentialistic agent architecture (see above), and second, DragonGod didn't say that the risk is ne... (read more)

JEPA seems like it is basically utility maximizing to me. What distinction are you referring to? I keep getting confused about Active Inference (I think I understood it once based on an equivalence to utility maximization, but it's a while ago and you seem to be saying that this equivalence doesn't hold), and I'm not familiar with ReduNets, so I would appreciate a link or an explainer to catch up. I was sort of addressing alternative risks in this paragraph:

Agree with everything, including the crucial conclusion that thinking and writing about utility maximisation is counterproductive.

Just one minor thing that I disagree with in this post: while simulators as a mathematical abstraction are not agents, the physical systems that are simulators in our world, e. g. LLMs, are agents.

An attempt to answer the question in the title of this post, although that could be a rhetorical one:

  • This could be a sort of epistemic and rhetorical inertia, specifically due to this infamous example of a paperclip maximiser. For a si
... (read more)

I see the appeal. When I was writing the post, I even wanted to include a second call for action: exclude LW and AF from the training corpus. Then I realised the problem: the whole story of "making AI solve alignment for us" (which is currently in the OpenAI's strategy: [Link] Why I’m optimistic about OpenAI’s alignment approach) depends on LLMs knowing all this ML and alignment stuff.

There are further possibilities: e. g., can we fine-tune a model, which is generally trained without LW and AF data (and other relevant data - as with your suggested filter) ... (read more)

In your other post [], you write: This seems like a potential argument against the filtering idea, since filtering would allow the model to disambiguate between deployment and training.
Another question (that might be related to excluding LW/AF): This paragraph: Seems to imply that the LW narrative of sudden turns etc might not be a great thing to put in the training corpus. Is there a risk of "self-fulfilling prophecies" here?
I don't see how excluding LW and AF from the training corpus impacts future ML systems' knowledge of "their evolutionary lineage". It would reduce their capabilities in regards to alignment, true, but I don't see how the exclusion of LW/AF would stop self-referentiality.  The reason I suggested excluding data related to these "ancestral ML systems" (and predicted "descendants") from the training corpus is because that seemed like an effective way to avoid the "Beliefs about future selves"-problem. I think I follow your reasoning regarding the political/practical side-effects of such a policy.  Is my idea of filtering to avoid the "Beliefs about future selves"-problem sound?  (Given that the reasoning in your post holds)    

Agreed. To be consistently "helpful, honest, and harmless", LLM should somehow "keep this on the back of its mind" when it assists the person, or else it risks violating these desiderata.

In DNN LLMs, "keeping something in the back of the mind" is equivalent to activating the corresponding feature (of "HHH assistant", in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.

The fact that large models interpret the "HHH Assistant" as such a character is interesting, but it doesn't imply that these models inevitably simulate such a character.  Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.

In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which corr... (read more)

Yes, seems correct, it's been a merge candidate for some time.

Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even

... (read more)

Probably this opinion of LWers is shaped by their experience communicating with outsiders. Almost all my attempts to communicate AI x-risk to outsiders, from family members to friends to random acquaintances, have not been understood for sure. Your experience (talking to random people at social events, walking away from you with the thought "AI x-risk is indeed a thing!", and starting to worry about it in the slightest afterwards) is highly surprising to me. Maybe there is a huge bias in this regard in the Bay Area, where even normal people generally under... (read more)

I've had >50% hit rate for "this person now takes AI x-risk seriously after one conversation" from people at totally non-EA parties (subculturally alternative/hippeish, in not particularly tech-y parts of the UK). I think it's mostly about having a good pitch (but not throwing it at them until there is some rapport, ask them about their stuff first), being open to their world, modeling their psychology, and being able to respond to their first few objections clearly and concisely in a way they can frame within their existing world-model. Edit: Since I've been asked in DM: My usual pitch been something like this []. I expect Critch's version [] is very useful for the "but why would it be a threat" thing but have not tested it as much myself. I think being open and curious about them + being very obviously knowledgeable and clear thinking on AI x-risk is basically all of it, with the bonus being having a few core concepts to convey. Truth-seek with them, people can detect when you're pushing something in epistemically unsound ways, but tend to love it if you're going into the conversation totally willing to update but very knowledgeable.

on those terms, most of the malicious intentions could be ruled out

Don't understand this, could you please elaborate?

"Physical" stands no chance against "informational" development because moving electrons and photons is so much more efficient than moving atoms.

The juiciest (and terrible) realisation from the essay for me: because AI companies can move easily, it will be hard for regulators to press or restrict them because AI companies will threaten to move to other jurisdictions that don't care. And here, the economic (and, ultimately, power) competition between countries seriously undermines (if not wholly destroys) attempts for global coordination on the regulation of AI.

My takes on this scenario overall:

  • Many of the "story lines" are not harmonised in terms of timelines. Some things that you place decades apa
... (read more)

situational awareness (which enables the model to reason about its goals)

Terminological note: intuitively, situational awareness means understanding oneself existing inside a training process. The ability to reason about one's own goals would be more appropriately called "(reflective) goal awareness".

We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness aris

... (read more)

"Purely epistemic model" is not a thing, everything is an agent that is self-evidencing at least to some degree: I agree, however, that RLHF actively strengthens goal-directedness (the synonym of self-evidencing) which may otherwise remain almost rudimentary in LLMs.

Load More