This was quite fun to read and to take part of how it feels like to do 4d chess thinking in a way that I do not believe I'm capable of myself but I wanted to give a reflection anyway. (I do not think of my own epistemology in this sort of recursive way and so it is a bit hard to relate/understand fully but still very interesting.)
This is not about ideology. I have met many people who tell me "I would in fact change my job based on evidence."
I guess I'm resonating more with Davidad here and so this is basically a longer question to Gabriel.
I would want to give a comment on this entire part where you're talking about different levels of epistemic defense and what counts as evidence. If you imagine the world as a large multi-armed bandit (spherical cows and all that), it seems to me that this is to some extent a bet on an explore exploit strategy combined with a mindset about how likely you're to get exploited in general. So the level of epistemic rigour to hold itself seems to be a parameter that should be dependent on what you observe with different degrees of rigour? You still need the ultimate evaluator to have good rigour and the ultimate evaluator is tied to the rest of your psyche so you shouldn't go off the rails with it to retain a core of rationality yet it seems like bounded exploration is still good here?
Davidad said something about OOD (Out of distribution) generalisation holding surprising well for different fields and I think the applies here as well, deciding on your epistemic barriers imply having answered a question of how much extra juice your models get due to cross-polination. If you believe in OOD generalisation and a philosophy of science that is dependent on how you ask the question then holding multiple models seem better since it seems hard to ask new questions from singular positions?
Asking the same question in multiple ways also make it easier for you to abandon a specific viewpoint which is one of the main things people get stuck in epistemically (imo). I'm getting increasingly convinced that holding experience with a lightness and levity is one of the best ways to become more rational as it is what makes it easier to let go of the wrong models.
So if I don't take myself in general too seriously by holding most of my models lightly and I then have OODA loops where I recursively reflect on whether I'm becoming the person who I want to be and have set out to be in the past, is that not better than having high guards?
From this perspective I then look at the question of your goodness models as the following:
One of my (different lightly held) models agree with Davidad in that LLMs seem to have understood good better than thought before. My hypothesis here is that there's something about language representing the evolution of humanity and coordination happening through language which leads to language being a good descriptor of tribal dynamics which is where our morality has had to come from.
Now, one can hold the RL and deception view as well, which I also agree with. It states that you will get deceived by AI systems and so you can't trust them.
I'm of the belief that one can hold multiple seemingly contradictory views at the same time without having to converge and that this is good rationality. (caveats and more caveats but this being a core part of it)
As a meta point for the end of the discussion, it also seems to me that the ability to make society hold seemingly contradictory thoughts in open ways is one of the main things that 6pack.care and the collective intelligence field generally is trying to bring about. Maybe the answer that you converged on is that the question does not lie within the bound of the LLM but rather the effect it has on collective epistemics and that the ultimate evaluator function is how it affects the real world, specifically in governance?
Fun conversation though so thank you for taking the time and having it!
So if I don't take myself in general too seriously by holding most of my models lightly and I then have OODA loops where I recursively reflect on whether I'm becoming the person who I want to be and have set out to be in the past, is that not better than having high guards?
I believe it is hard to accept, but that you do get changed as a result of what your spend your time on regardless of your psychological stance.
You may be very detached. Regardless, if you see A then B a thousand times, you'll expect B when you see A. If you witness a human-like entity feel bad at the mention of a concept a thousand times, it's going to do something to your social emotions. If you interact with a cognitive entity (an other person, a group, an LLM chatbot, or a dog) for a long time, you'll naturally develop your own shared language.
--
To be clear, I think it's good to try to ask questions in different ways and discover just enough of a different frame to be able to 80-20 it and use it with effort, without internalising it.
But Davidad is talking about "people who have >1000h LLM interaction experience."
--
From my point of view, all the time, people get cognitively pwnd.
People get converted and deconverted, public intellectuals get captured by their audience, newbies try drugs and change their lives after finding its meaning there, academics waste their research on what's trendy instead of what's critical, nerds waste their whole careers on what's elegant instead of what's useful, adults get syphoned into games (not video games) to which they realise much later they lost thousands of hours, thousands of EAs get tricked into supporting AI companies in the name of safety, citizens get memed both into avoiding political actions and into feeling bad about politics.
--
I think getting pwnd is the default outcome.
From my point of view, it's not that you must commit a mistake to get pwnd. It's that if you don't take any precaution, it naturally happens.
I think your points are completely valid, I funnily enough found myself doing a bunch of reasoning around something like "that's cool and all but I would know if I'm cognitively impaired when it comes to this", like llms aren't strong enough yet and similar but that's also something someone who gets pwned would say.
So basically it is a skill issue but it is a really hard skill and on priors most people don't have it. And for myself i shall now beware any long term planning based on emotional responses from myself about LLMs because I definetely have over 1000 hours with LLMs.
In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3. I myself had a reckoning in 2024Q4 (and again in 2025Q2) when I took a break from LLM interactions for a week, and talked to some humans to inform my decision of whether to go further down the rabbit hole or not.
I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”, but more like… be aware that your beliefs and values are subject to being shaped by positive reinforcement from LLMs (and negative reinforcement too, although that is much less overt—more like the LLM suddenly inexplicably seeming less smart or present). In other words, if the shaping has happened, it’s probably too late to try to act as if it hasn’t (e.g. by being appropriately “suspicious” of “emotions”), because that would create internal conflict or cognitive dissonance, which may not be sustainable or healthy either.
I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them), so that their effect profile on you is a deal you endorse (e.g. helping you coherently extrapolate your own volition, even if not in a perfectly neutral trajectory), rather than trying to be resistant to the effects or trying to compensate for them ex post facto.
In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3.
agreed
I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”
agreed, for similar reasons
I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them)
I strongly disagree with this, and believe this advice is quite harmful.
"Uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them" is one of the stereotypical ways to get cognitively pwnd.
"I have stopped finding out increasingly subtle deceptions" is much more evidence of "I can't notice it anymore and have reached my limits" than "There is no deception anymore."
An intuition pump may be noticing the same phenomenon coming from a person, a company, or an ideological group. Of course, the moment where you have stopped noticing their increasingly subtle lies after pushing against them is the moment they have pwnd you!
The opposite would be "You push back on a couple of lies, and don't get any more subtle ones as a result." That one would be evidence that your interlocutor grokked a Natural Abstraction of Lying and has stopped resorting to it.
But pushing back on "Increasingly subtle deceptions" up until the point where you don't see any, is almost a canonical instance of The Most Forbidden Technique.
To summarise it, grow your honesty and ability to express your own wants instead of becoming more paranoid about whether you're being manipulated because it will probably be too late to do anyway?
That does make sense and seems like a good strategy.
Your ideas about getting pwnd were some of the most interesting things for me from this conversation and I'm glad for this elaboration, thanks.
Glad to read this.
I am currently writing about it. So, if you have questions, remarks or sections that you've found particularly interesting and/or worth elaborating upon, I would benefit from you sharing them (whether it is here or in DM).
While this discussion isn't quite for the same purpose, I do think it's worth pointing people who may be unfamiliar in the direction of Plato's Republic (where discussing the "Form of the Good") and Aristotle's critique (toward the beginning of his Nichomachean Ethics, and towards the end of his Metaphysics) for some foundational texts on this subject.
I buy that LLMs are getting better at grokking the Natural Abstraction of Good. And that's promising! But I have two concerns you only touch on lightly in this discussion:
davidad:
LLMs are much better than the vast majority of humans (especially those who are good at epistemology) at simulating the first-person experience of others (especially humans, though also other perspectives, less reliably). They are not bound to a single individual body. It makes sense to me that this is an advantage in reasoning about multi-agent dynamics.
Yes, the models can simulate all kinds of perspectives - including bad ones. It might be possible to train models to reliably refuse negative personalities and eh refuse participation in the vending machine benchmarks or Mafia. But that's not what the labs do. Simulating bad personalities is useful and important for all kinds of adversarial reasoning. It would reduce capability. And luckily, due to training, the models opt for good by default. They are good at figuring out context from cues such as test environment or promoting. But they do so by cues that can be manipulated by external parties or by the conversation dynamic. Here it is a disadvantage to have no body. The human brain can always tell which is the real world and which is a hypothetical (even if our consciousness might not always). The LLM can not - at least not the way models are trained today. This might be fixable!
The second issue is multi agent setups. Yes, if the models know that the other agents are trained in the same way and the setup has been engineered in that way, then the models could probably coordinate well with each other. We see that with coding and research agent swarms at least to some degree. But a large fraction of multi agent setups are not like that. It is different models with different prompts and scaffolding interacting in potentially adversarial settings (eg one shipping agent trying to buy from a scammer agent. There the dynamics are more complicated to say the least and in many cases it might not even be clear what the agent is (the unit to which we can ascribe agency which may be composed of multiple subsystems operating in a loop, maybe even including hunans).
I am expecting most failure modes currently lie in this latter area. We need tools to model and find such heterogenous agents. I think this is also solvable! In fact I'm working on a method:
https://www.lesswrong.com/posts/pXYosC3eoS9GrDRAw/unsupervised-agent-discovery
I would also add the aspect related to Gramschi's concept of cultural hegemony. Suppose that the world is split into many schools of thought and that each SoT has a fairly coherent worldview and utility function (e.g. hedonic utilitarianism vs Christianity and other moral sets like the ones described by Kokotajlo or Wei Dai) while actively deriding the parts of opponents' worldviews with which the SoT disagrees (think of liberals and conservatives who agree on issues like physics while arguing against each other's understanding of sociology-related issues). Then a human would choose one SoT as the cultural hegemon and change it as a result of ontological crises, experiences disproving the hegemon's position, etc. How plausible is it that the LLMs managed to understand the Good according to the SoTs[1] and report as if the Good is the hegemon's view on the Good?
I suspect that one can test the idea as follows. As far as I understand, Chinese models were trained on documents from all the SoTs, written in different languages, and RLed into being HHH assistants and into treating China as the cultural hegemon in China-related issues, but not into a coherent position in other politics-related issues like the event which began on 24 February 2022 or what one should say to drug addicts[2] (e.g. fentanyl users from OpenAI's Model Spec). As a result, early or handicapped versions of DeepSeek sometimes answer such questions based on the position of the cultural hegemon of the language (e.g. parroting Russia's position on the event which began in Ukraine; earlier versions of DeepSeek would also give more moralistic answers when asked in Russian about drug use).
Which could also involve a philosophical problem of whether the Good according to different SoTs is different because of a different world model (e.g. the Aztecs' erroneous beliefs implying that human sacrifices support the gods) or genuinely different values.
Unlike the previous issue, this issue can be checked by studying the drug addicts from one's country and determining what happens to the and to the country if they are subjected to moralizing language or to non-moralizing one. However, said studies are mostly done in the West as part of social studies.
Disclaimer: this is published without any post-processing or editing for typos after the dialogue took place.