Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea

LESSWRONG
LW

Bogdan Ionut Cirstea's Shortform

1 min read13th Jul 202373 comments

This is a special post for quick takes by Bogdan Ionut Cirstea. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

Getting Started

FAQ

Library

Bogdan Ionut Cirstea's Shortform

13th Jul 2023

22Bogdan Ionut Cirstea

5AprilSR

1Bogdan Ionut Cirstea

3AprilSR

2Bogdan Ionut Cirstea

2RHollerith

20Bogdan Ionut Cirstea

5habryka

5JBlack

15Bogdan Ionut Cirstea

7ryan_greenblatt

7Bogdan Ionut Cirstea

2Nathan Helm-Burger

4Bogdan Ionut Cirstea

3Bogdan Ionut Cirstea

1Bogdan Ionut Cirstea

3Bogdan Ionut Cirstea

1Bogdan Ionut Cirstea

3Bogdan Ionut Cirstea

6Vladimir_Nesov

3Bogdan Ionut Cirstea

1Bogdan Ionut Cirstea

3Bogdan Ionut Cirstea

2Bogdan Ionut Cirstea

3jacquesthibs

1Bogdan Ionut Cirstea

2faul_sname

2the gears to ascension

2ryan_greenblatt

1Bogdan Ionut Cirstea

2ryan_greenblatt

1Bogdan Ionut Cirstea

2Bogdan Ionut Cirstea

5ryan_greenblatt

4ryan_greenblatt

1Bogdan Ionut Cirstea

4ryan_greenblatt

1Bogdan Ionut Cirstea

2Bogdan Ionut Cirstea

5ryan_greenblatt

1Bogdan Ionut Cirstea

3ryan_greenblatt

3faul_sname

1Bogdan Ionut Cirstea

2Olli Järviniemi

2Bogdan Ionut Cirstea

1Bogdan Ionut Cirstea

1quetzal_rainbow

1Bogdan Ionut Cirstea

1quetzal_rainbow

1Bogdan Ionut Cirstea

1quetzal_rainbow

1Bogdan Ionut Cirstea

73 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:50 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Bogdan Ionut Cirstea9mo2210

Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.

5AprilSR9mo

hmm. i think you're missing eliezer's point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.

1Bogdan Ionut Cirstea9mo

But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.

3AprilSR9mo

I don't think having a beauty-detector that works the same way humans' beauty-detectors do implies that you care about beauty?

2Bogdan Ionut Cirstea9mo

Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).

2RHollerith9mo

Even if Eliezer's argument in that Twitter thread is completely worthless, it remains the case that "merely hoping" that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as "merely hoping" the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you? Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?

[-]Bogdan Ionut Cirstea8d20-2

Contra both the 'doomers' and the 'optimists' on (not) pausing. Rephrased: RSPs (done right) seem right.

Contra 'doomers'. Oversimplified, 'doomers' (e.g. PauseAI, FLI's letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in '[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' We clearly don't have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

Contra 'optimists'. I'm more unsure here, but the vibe I'm getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like 'no paus... (read more)

5habryka7d

At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article "Pausing AI Developments Isn't Enough. We Need to Shut it All Down"), so I am confused why you list him with anything related to "pause". My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period.

5JBlack8d

The relevant criterion is not whether the current models are likely to be x-risky (it's obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they're likely to be embedded into. Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don't seem to be slowing, I think the total chance of x-risk from those is not insignificant.

[-]Bogdan Ionut Cirstea1mo152

Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:

'We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.'

7ryan_greenblatt1mo

Surely fundamentally at odds? You can't spend a while thinking without spending a while thinking. Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.

[-]Bogdan Ionut Cirstea10mo70

Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).

2Nathan Helm-Burger3mo

I think the Corrigibility agenda, framed as "do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals" is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than "solved by default, no need to worry".

[-]Bogdan Ionut Cirstea17d40

Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.

Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.

For example, probing / activation steering workflows seem to me relatively ... (read more)

[-]Bogdan Ionut Cirstea5mo42

quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they're considering any model internals work (and of course even more so if they're specifically considering mech interp)

[-]Bogdan Ionut Cirstea8d30

Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:

'Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latte... (read more)

1Bogdan Ionut Cirstea21h

Related, from The “no sandbagging on checkable tasks” hypothesis:

1Bogdan Ionut Cirstea20h

Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:

[-]Bogdan Ionut Cirstea8d30

Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).

My prompts bolded:

'Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible... (read more)

1Bogdan Ionut Cirstea6d

More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models - like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.

1Bogdan Ionut Cirstea5d

Turns out, someone's already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7

1Bogdan Ionut Cirstea8d

Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have 'read the whole internet' and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it's hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).

[-]Bogdan Ionut Cirstea4mo30

Very plausible view (though doesn't seem to address misuse risks enough, I'd say) in favor of open-sourced models being net positive (including for alignment) from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/:

'While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as ... (read more)

6Vladimir_Nesov4mo

Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it's certainly net positive, as it's the kind of research that the moratorium is buying time for, and it doesn't shorten timelines because they are guarded by the moratorium. It's still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.

[-]Bogdan Ionut Cirstea6mo3-7

More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I'd bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036

[-]Bogdan Ionut Cirstea10mo30

Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).

And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - mul... (read more)

1Bogdan Ionut Cirstea2mo

Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: 'In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment' (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.

[-]Bogdan Ionut Cirstea10mo30

(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')

LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral... (read more)

[-]Bogdan Ionut Cirstea22d23

I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.

Notably, MAIA (GPT-4V-based) seems approximately human-level on a bunch of interp tasks, while (overwhelmingly likely) being non-scheming (e.g. current models are bad at situational awareness and out-of-context reasoning) and basically-not-x-risky (e.g. bad at ARA).

Given the potential scalability of automated interp, I'd be excited to see plans to use large amo... (read more)

3jacquesthibs21d

Hey Bogdan, I'd be interested in doing a project on this or at least putting together a proposal we can share to get funding. I've been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization). I saw the MAIA paper, too; I'd like to look into it some more. Anyway, here's a related blurb I wrote: Whether this works or not, I'd be interested in making more progress on automated interpretability, in the similar ways you are proposing.

1Bogdan Ionut Cirstea21d

Hey Jacques, sure, I'd be happy to chat!

2faul_sname21d

@the gears to ascension I see you reacted "10%" to the phrase "while (overwhelmingly likely) being non-scheming" in the context of the GPT-4V-based MAIA. Does that mean you think there's a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I'd be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?

2the gears to ascension21d

ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it's not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn't either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it'll look defensible, when a fair analysis would reveal it isn't.

2ryan_greenblatt21d

It seems to me like the sort of interpretability work you're pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale. So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don't really see this as the bottleneck.

1Bogdan Ionut Cirstea21d

Yeah, I'm unsure if I can tell any 'pivotal story' very easily (e.g. I'd still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable.

2ryan_greenblatt20d

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of internals. I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.

1Bogdan Ionut Cirstea20d

This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in 'Catching AIs red-handed'), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).

[-]Bogdan Ionut Cirstea1mo20

Recent long-context LLMs seem to exhibit scaling laws from longer contexts - e.g. fig. 6 at page 8 in Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, fig. 1 at page 1 in Effective Long-Context Scaling of Foundation Models.

The long contexts also seem very helpful for in-context learning, e.g. Many-Shot In-Context Learning.

This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differ... (read more)

[-]Bogdan Ionut Cirstea2mo20

A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL - imitation learning (I don't feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like:

Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: 'This work shows that, from the statisti... (read more)

5ryan_greenblatt2mo

IL = imitation learning.

4ryan_greenblatt2mo

I'd bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.

1Bogdan Ionut Cirstea2mo

Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun's cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn't seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.

4ryan_greenblatt2mo

I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.) I'm skeptical hard theory had much influence on anyone though. (In this domain at least.)

1Bogdan Ionut Cirstea2mo

Some more (somewhat) related papers: Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity ('We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.'). On Representation Complexity of Model-based and Model-free Reinforcement Learning ('We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory

[-]Bogdan Ionut Cirstea3mo20

I'm not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.

5ryan_greenblatt3mo

Edit: oops I read this as "automated AI capabilies R&D". METR and UK AISI are both interested in this. I think UK AISI is working on this directly while METR is working on this indirectly. See here.

1Bogdan Ionut Cirstea3mo

Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don't seem like the same thing). I've also chatted to some people from both METR and UK AISI and got the impression from all of them that there's some focus on automated AI capabilities R&D evals, but not on safety.

3ryan_greenblatt3mo

Oops, misread you. I think some people at superalignment (OpenAI) are interested in some version of this and might already be working on this.

3faul_sname3mo

Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you're hoping to see?

[-]Bogdan Ionut Cirstea4d10

Conversation with Claude Opus on A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA). To me it seems surprisingly good at something like coming up with plausible alignment research follow-ups, which e.g. were highlighted here as an important part of the superalignment agenda.

Prompts bolded:

'Summarize 'Causal Explainable Guardrails for Large
Language Models'. In particular, could this be useful to deal with some of the challenges m... (read more)

1Bogdan Ionut Cirstea4d

Continued: 'How could the above be combined with 'Catching AIs red-handed' and, in particular, the ideas mentioned in paulfchristiano's comment? The approaches discussed in "A Multimodal Automated Interpretability Agent" (MAIA) and "Causal Explainable Guardrails for Large Language Models" (LLMGuardaril) could potentially be combined with the ideas in "Catching AIs red-handed" and paulfchristiano's comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems: 1. Using MAIA to automatically generate and test hypotheses about deceptive behavior: As discussed in the "Catching AIs red-handed" post, once a suspicious action has been flagged, we can use techniques like varying the AI's context and beliefs to test whether the behavior is truly deceptive. MAIA's automated experimentation capabilities could potentially be used to scale up this kind of hypothesis generation and testing, systematically exploring the "landscape of deceptive behavior" as paulfchristiano puts it. 2. Applying causal analysis from LLMGuardaril to rule out alternative explanations: A key challenge raised in the post and comments is distinguishing true deception from an AI system just "kind of unlucky in how its context shifted it into a more deceptive role-play". The causal analysis techniques from LLMGuardaril, which aim to disentangle the influence of semantic context on an AI's behavior, could potentially be applied by MAIA to more confidently establish whether a pattern of suspicious behavior is caused by intentional deception vs. an unlucky context. 3. Using MAIA's explainable component to interpret the AI's reasoning: In paulfchristiano's examples of analyzing a suspicious code review or context-dependent behavior, a critical step is understanding what information the AI is using to make its decisions. MAIA's explainable component, which measures the alignment between an AI's outputs and experimenter-specified directions, could pot

1Bogdan Ionut Cirstea4d

'Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), 'Catching AIs red-handed', including paulfchristiano's comment) with ideas from 'Searching for a model's concepts by their shape – a theoretical framework' to 'catch AIs red-handed'? Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed: 1. Using the "deliberate adversarial experiments" mentioned in paulfchristiano's comment, like varying the model's beliefs about how closely it is being monitored, to surface cases where the model's behavior is suspiciously sensitive to being watched. The theoretical framework from "Searching for concepts by their shape" could be used to formalize the hypothesis that the model has a hidden "deception mode" that gets triggered when it believes oversight is lax. MAIA's automated experimentation tools could be used to efficiently search for inputs exhibiting this sensitivity. 2. Comparing the features extracted by the interpretability tools from "Searching for concepts by their shape" across different copies of the model that have been adversarially perturbed to have different beliefs (per paulfchristiano's comment). If there are decision-relevant features (e.g. representations of the model's objectives) that are suspiciously different in the copy that believes it is unobserved, that would be strong evidence of deceptive cognition. The causal analysis from "Causal Explainable Guardrails" could help determine whether those differing features are causally linked to the differing outputs. 3. Training MAIA's detectors for suspicious model outputs not just on straightforwardly deceptive behaviors, but on subtler signs of agency-like cognition hypothesized in paulfchristiano's comment, e.g. the model giving worse answers when it

[-]Bogdan Ionut Cirstea11d10

I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)).

This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).

[-]Bogdan Ionut Cirstea3mo10

I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 - 1 million tokens) might provide to a single-forward-pass, even if limited serially.

1Bogdan Ionut Cirstea2mo

Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.

2Olli Järviniemi2mo

Somewhat relatedly: I'm interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1] The "I've thought about this for 2 minutes" version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass. (One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.) 1. ^ Two quick reasons: - For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it's harder to have intuitions for parallel computation. - For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.

2Bogdan Ionut Cirstea2mo

I would expect, generally, solving tasks in parallel to be fundamentally hard in one-forward pass for pretty much all current SOTA architectures (especially Transformers and modern RNNs like MAMBA). See e.g. this comment of mine; and other related works like https://twitter.com/bohang_zhang/status/1664695084875501579, https://twitter.com/bohang_zhang/status/1664695108447399937 (video presentation), Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval. There might be more such results I'm currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).

1Bogdan Ionut Cirstea2mo

I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.

1quetzal_rainbow18d

https://arxiv.org/abs/2404.15758 "We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge."

1Bogdan Ionut Cirstea18d

Thanks, seen it; see also the exchanges in the thread here: https://twitter.com/jacob_pfau/status/1784446572002230703.

1quetzal_rainbow12d

I looked over it and I should note that "transformers are in TC0" is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to "whatever computation is practical to implement".

1Bogdan Ionut Cirstea11d

I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.

1Bogdan Ionut Cirstea11d

Also, TC0 is very much limited, see e.g. this presentation.

1quetzal_rainbow4d

Everything Turing-complete requires infinite memory. When we are saying "x86 set of instructions is Turing-complete" we imply "assuming that processor operates on infinite memory". It's in definition of Turing machine to include infinite tape, after all. It's hard to pinpoint, but the trick is that it's very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and "transformers can't do X". Like, there is nothing preventing transformers from playing chess perfectly - they just need to be sufficiently large for this. To answer the question "can transformers do X" you need to ask "how much computing power transformer has" and "can this computing power be shaped by SGD into solution".

1quetzal_rainbow2mo

It's interesting question whether Gemini has any improvements.

[-]Bogdan Ionut Cirstea3mo10

I've been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be 'groundable' in fMRI d... (read more)

[-]Bogdan Ionut Cirstea4mo10

This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061

[-]Bogdan Ionut Cirstea4mo10

Larger LMs seem to benefit differentially more from tools: 'Absolute performance and improvement-per-turn (e.g., slope) scale with model size.' https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.

[-]Bogdan Ionut Cirstea5mo10

In my book, this would probably be the most impactful model internals / interpretability project that I can think of: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit?commentId=qByLyr6RSgv3GBqfB

[-]Bogdan Ionut Cirstea5mo1-2

Large scale cyber-attacks resulting from AI misalignment seem hard, I'm at >90% probability that they happen much later (at least years later) than automated alignment research, as long as we *actually try hard* to make automated alignment research work: https://forum.effectivealtruism.org/posts/bhrKwJE7Ggv7AFM7C/modelling-large-scale-cyber-attacks-from-advanced-ai-systems.

[-]Bogdan Ionut Cirstea5mo10

I had speculated previously about links between task arithmetic and activation engineering. I think given all the recent results on in context learning, task/function vectors and activation engineering / their compositionality (In-Context Learning Creates Task Vectors, In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering, Function Vectors in Large Language Models), this link is confirmed to a large degree. This might also suggest trying to import improvements to task arithmetic (e.g. Task Arithmetic i... (read more)

1Bogdan Ionut Cirstea5mo

speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic - activation engineering link

1Bogdan Ionut Cirstea7d

For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.

Moderation Log