LESSWRONG
LW

458
Cameron Berg
1893Ω8623640
Message
Dialogue
Subscribe

Currently doing alignment and digital minds research @AE Studio 

Meta AI Resident '23, Cognitive science @ Yale '22, SERI MATS '21, LTFF grantee. 

Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Paradigm-Building for AGI Safety Research
So You Think You've Awoken ChatGPT
Cameron Berg2mo235

I personally think "AAAAAAAA" is an entirely rational reaction to this question. :)

Not sure I fully agree with the comment you reference:

AI is probably what ever amount of conscious it is or isn't mostly regardless of how it's prompted. If it is at all, there might be some variation depending on prompt, but I doubt it's a lot.

Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented 'under the hood' in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!). 

And agree fully with:

it's a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior

This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.

Reply
So You Think You've Awoken ChatGPT
Cameron Berg2mo7917

Agree with much of this—particularly that these systems are uncannily good at inferring how to 'play along' with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):

Most experts, however, express uncertainty. Consciousness remains one of the most contested topics in science and philosophy. There are no universally accepted criteria for what makes a system conscious, and today’s AIs arguably meet several commonly proposed markers: they are intelligent, use attention mechanisms, and can model their own minds to some extent. While some theories may seem more plausible than others, intellectual honesty requires us to acknowledge the profound uncertainty, especially as AIs continue to grow more capable.

The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises: 

What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be? 

The current best answer (à la the above) is: we really don't know. These systems' internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.

We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:

We don't know for sure, but I doubt AIs are firing off patterns related to deception or trickery when claiming to be conscious; in fact, this is an unresolved empirical question.

We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven't published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.

So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are "nothing short of miraculous," as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.

Reply
Interim Research Report: Mechanisms of Awareness
Cameron Berg4mo41

Nice work. To me, this seems less like evidence that self-awareness is trivial, and more like evidence that it’s structurally latent. A single steering vector makes the model both choose risky options and say “I am risk-seeking”—despite the self-report behavior never being trained for. That suggests the model’s internal representations of behavior and linguistic self-description are already aligned. It’s probably not introspecting in a deliberate sense, but the geometry makes shallow self-modeling an easy, natural side effect.

Reply
Reducing LLM deception at scale with self-other overlap fine-tuning
Cameron Berg6moΩ120

Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with 'syntactic?' Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.

Reply
Reducing LLM deception at scale with self-other overlap fine-tuning
Cameron Berg6moΩ3117

The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?

Reply
Making a conservative case for alignment
Cameron Berg10mo60

We've spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not "median conservatives").

Reply
Making a conservative case for alignment
Cameron Berg10mo00

Fixed, thanks!

Reply
Making a conservative case for alignment
Cameron Berg10mo53

Note this is not equivalent to saying 'we're almost certainly going to get AGI during Trump's presidency,' but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).

Reply
Making a conservative case for alignment
Cameron Berg10mo20

One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like 'alignment' is aligning the AI with the obviously very leftist companies that make it rather than with the user!

Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.

That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)

I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.

Reply
Making a conservative case for alignment
Cameron Berg10mo1313

Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized.
...
I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for.
...
Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.

I would agree in a world where the proverbial bug hasn't already made any contact with the glue trap, but this very thing has clearly already been happening for almost a year in a troubling direction. The political left has been fairly casually 'Everything-Bagel-izing' AI safety, largely in smuggling in social progressivism that has little to do with the core existential risks, and the right, as a result, is increasingly coming to view AI safety as something approximating 'woke BS stifling rapid innovation.' The fly is already a bit stuck. 

The point we are trying to drive home here is precisely what you're also pointing at: avoiding an AI-induced catastrophe is obviously not a partisan goal. We are watching people in DC slowly lose sight of this critical fact. This is why we're attempting to explain here why basic AI x-risk concerns are genuinely important regardless of one's ideological leanings. ie, genuinely important to left-leaning and right-leaning people alike. Seems like very few people have explicitly spelled out the latter case, though, which is why we thought it would be worthwhile to do so here.

Reply1
Load More
94The Mirror Trap
3mo
13
81Mistral Large 2 (123B) seems to exhibit alignment faking
Ω
6mo
Ω
4
162Reducing LLM deception at scale with self-other overlap fine-tuning
Ω
6mo
Ω
46
68Alignment can be the ‘clean energy’ of AI
7mo
8
208Making a conservative case for alignment
10mo
67
100Science advances one funeral at a time
10mo
9
91Self-prediction acts as an emergent regularizer
Ω
11mo
Ω
9
77The case for a negative alignment tax
1y
20
226Self-Other Overlap: A Neglected Approach to AI Alignment
Ω
1y
Ω
51
62There Should Be More Alignment-Driven Startups
Ω
1y
Ω
14
Load More