Currently doing alignment and digital minds research @AE Studio
Meta AI Resident '23, Cognitive science @ Yale '22, SERI MATS '21, LTFF grantee.
Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
I personally think "AAAAAAAA" is an entirely rational reaction to this question. :)
Not sure I fully agree with the comment you reference:
AI is probably what ever amount of conscious it is or isn't mostly regardless of how it's prompted. If it is at all, there might be some variation depending on prompt, but I doubt it's a lot.
Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented 'under the hood' in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!).
And agree fully with:
it's a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior
This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.
Agree with much of this—particularly that these systems are uncannily good at inferring how to 'play along' with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):
Most experts, however, express uncertainty. Consciousness remains one of the most contested topics in science and philosophy. There are no universally accepted criteria for what makes a system conscious, and today’s AIs arguably meet several commonly proposed markers: they are intelligent, use attention mechanisms, and can model their own minds to some extent. While some theories may seem more plausible than others, intellectual honesty requires us to acknowledge the profound uncertainty, especially as AIs continue to grow more capable.
The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises:
What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be?
The current best answer (à la the above) is: we really don't know. These systems' internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.
We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:
We don't know for sure, but I doubt AIs are firing off patterns related to deception or trickery when claiming to be conscious; in fact, this is an unresolved empirical question.
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven't published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are "nothing short of miraculous," as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.
Nice work. To me, this seems less like evidence that self-awareness is trivial, and more like evidence that it’s structurally latent. A single steering vector makes the model both choose risky options and say “I am risk-seeking”—despite the self-report behavior never being trained for. That suggests the model’s internal representations of behavior and linguistic self-description are already aligned. It’s probably not introspecting in a deliberate sense, but the geometry makes shallow self-modeling an easy, natural side effect.
Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with 'syntactic?' Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.
The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?
We've spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not "median conservatives").
Fixed, thanks!
Note this is not equivalent to saying 'we're almost certainly going to get AGI during Trump's presidency,' but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).
One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like 'alignment' is aligning the AI with the obviously very leftist companies that make it rather than with the user!
Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.
That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)
I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.
Thanks for putting this all together.
I need to flag nontrivial issues in the "Neglected Approaches" section (AE Studio). The three listed outputs have correct links but appear to be hallucinated titles rather than names of real public papers or posts:
The listed critique "The 'Alignment Bonus' is a Dangerous Mirage" neither seems to actually exist nor links to anything real (the URL "lesswrong.com/posts/slug/example-critique-neg-tax" is clearly an LLM-generated placeholder).
These titles are plausible-sounding composites that capture themes of our work, but they aren't actual artifacts. This seems like LLM synthesis that slipped through review. Not sure for how many other sections this is the case.
FWIW, here are our actual outputs from the relevant period: