All of Bogdan Ionut Cirstea's Comments + Replies

3Seth Herd5d
Thanks! This seems pretty obvious, from this perspective, right? But there's a lot of concern that outer alignment being hard makes the alignment problem much harder. It seems like you can easily just punt on outer alignment, so I think it's very likely that's what people will do.

Agree, I do mostly discuss LLMs, but I think there's significant overlap in aligning LLMs and LMAs.

Also agree that LMAs could scale all the way, but I also think once you get ~human-level automated alignment research, its likely applicability to other types of systems too (than LMAs and LLMs) should still be a nice bonus.

So once an AI system trained end-to-end can produce similarly much value per token as a human researcher can produce per second, AI research will be more than fully automated. This means that, when AI first contributes more to AI research than humans do, the average research progress produced by 1 token of output will be significantly less than an average human AI researcher produces in a second of thinking[6]. Instead, the collective’s intelligence will largely come from a combination of things like:

  • Individual systems “thinking” for a long time, churning
... (read more)

One somewhat obvious thing to do with ~human-level systems with "initial loose alignment" is to use them to automate alignment research (e.g. the superalignment plan). I think this kind of two-step plan is currently the best we have, and probably by quite some margin. Many more details for why I believe this in these slides and in this AI safety camp '24 proposal.

3Seth Herd14d
I made time to go through your slides. You appear to be talking about LLMs, not language model agents. That's what I'm addressing. If you can align those, using them to align a different type of AGI would be a bit beside the point in most scenarios (maybe they'd progress so slowly that another type would overtake them before they pulled off a pivotal act using LMA AGI). I don't see a barrier to LMAs achieving full, agentic AGI. And I think they'll be so useful and interesting that they'll inevitably be made pretty efficiently. I don't quite understand why others don't agree that this will happen. Perhaps I'll write a post question asking why this is.
2Fabien Roger20d
Interesting! The technique is cool, though I'm unsure how compute efficient their procedure is. Their theoretical results seem mostly bogus to me. What's weird is that they have a thought experiment that looks to me like an impossibility result of breaking watermarks (while their paper claim you can always break them): In that situation, it seems to me that the attacker can't find another high quality output if it doesn't know what the N high quality answers are, and is therefore unable to remove the watermark without destroying quality (and the random walk won't work since the space of high quality answer is sparse). I can find many other examples where imbalance between attackers and defenders means that the watermarking is unbreakable. I think that only claims about what happens on average with realistic tasks can possibly be true.

More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I'd bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la

You might be interested in this AI safety camp '23 project I proposed of fine-tuning LMs on fMRI data and in some of the linkposts I've published on LW, including e.g. The neuroconnectionist research programme, Scaling laws for language encoding models in fMRI and Mapping Brains with Language Models: A Survey. Personally, I'm particularly interested in low-res uploads for automated alignment research, e.g. to plug into something like the superalignment plan (I have some shortform notes on this).

I don't know if I should be surprised by CoT not helping that much on MMLU; MMLU doesn't seem to require [very] long chains of inference? In contrast, I expect takeover plans would. Somewhat related, my memory is that CoT seemed very useful for Theory of Mind (necessary for deception, which seems like an important component of many takeover plans), but the only reference I could find quickly is  

I'll note that there's actually a lot of evidence (especially theoretical) on the need for scratchpad/CoT and how it leads to much higher expressivity, both for Transformers and (conjectured) more generally for any parallelizable architecture (crucial for efficient training); to the point that I think we should expect this to hold in the future too with significantly >50% probability, probably >90%. See e.g. The Parallelism Tradeoff: Limitations of Log-Precision Transformers, Auto-Regressive Next-Token Predictors are Universal Learners, Chain of Thou... (read more)

5Fabien Roger1mo
Fully agree that there are strong theoretical arguments for CoT expressiveness. Thanks for the detailed references! I think the big question is whether this expressiveness is required for anything we care about (e.g. the ability to take over), and how many serial steps are enough. (And here, I think that the number of serial steps in human reasoning is the best data point we have.). Another question is whether CoT & natural-language are in practice able to take advantage of the increased number of serial steps: it does in some toy settings (coin flips count, ...), but CoT barely improves performances on MMLU and common sense reasoning benchmarks. I think CoT will eventually matter a lot more than it does today, but it's not completely obvious.

One significant worry here would be that bounds from (classical) learning theory seem to be pretty vacuous most of the time. But I'm excited about comparing brains and learning algos, also see many empirical papers

5Garrett Baker1mo
I'm hopeful that SLT's bounds are less vacuous than classical learning theoretic bounds, partially because non-equilibrium dynamics seem more tractable with such dominating singularities, and partially because all the equations are equality relations right now, not bounds.

As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-me... (read more)

Haven't read in detail but Fig. 2 seems to me to support the exciting claim (also because overparameterized models with 70k trainable parameters)?

3Charlie Steiner2mo
Okay, sure, I kind of buy it. Generated images are closer to each other than to the nearest image in the training set. And the denoisers learn similar heuristics like "do averaging" and "there's probably a face in the middle of the image." I still don't really feel excited, but maybe that's me and not the paper.

Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g.

But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.

I don't think having a beauty-detector that works the same way humans' beauty-detectors do implies that you care about beauty?

Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. and about how gradient descent is supposed to be nothing like that I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points wh... (read more)

Even if Eliezer's argument in that Twitter thread is completely worthless, it remains the case that "merely hoping" that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as "merely hoping" the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you? Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?
hmm. i think you're missing eliezer's point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.

Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum). 

Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from and could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including... (read more)

Here's one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.

I still don't quite see the connection - if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra? Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?
1Bogdan Ionut Cirstea4mo
Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from and could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including e.g. that the results mostly seem to apply to single-layer linear attention (and related, activation engineering doesn't seem to work in all / any layers / attention heads).

Great work and nice to see you on LessWrong!

Minor correction: 'making the link between activation engineering and interpolating between different simulators' -> 'making the link between activation engineering and interpolating between different simulacra' (referencing Simulators, Steering GPT-2-XL by adding an activation vector, Inference-Time Intervention: Eliciting Truthful Answers from a Language Model). 

Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species ( and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (

And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - mul... (read more)

(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')

LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to could be applied to other alignment-adjacent domains/tasks, e.g. moral... (read more)

Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws ( + LM embeddings as model of shared linguistic space for transmitting thoughts during communication ( suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).

From a (somewhat) related proposal (from footnote 1): 'My proposal is simple. Are you developing a method of interpretation or analyzing some property of a trained model? Don’t just look at the final checkpoint in training. Apply that analysis to several intermediate checkpoints. If you are finetuning a model, check several points both early and late in training. If you are analyzing a language model, MultiBERTs, Pythia, and Mistral provide intermediate checkpoints sampled from throughout training on masked and autoregressive language models, respectively.... (read more)

7Jesse Hoogland5mo
Yes (see footnote 1)! The main place where devinterp diverges from Naomi's proposal is the emphasis on phase transitions as described by SLT. During the first phase of the plan, simply studying how behaviors develop over different checkpoints is one of the main things we'll be doing to establish whether these transitions exist in the way we expect.

AIs could have representations of human values without being motivated to pursue them; also, their representations could be a superset of human representations.

(In practice, I do think having overlapping representations with human values likely helps, for reasons related to e.g. Predicting Inductive Biases of Pre-Trained Models and Alignment with human representations supports robust few-shot learning.)

Indeed their representations could form a superset of human representations, and that’s why it’s not random. Or, equivalently, it’s random but not under uniform prior. (Yes, these further works are more evidence for « it’s not random at all », as if LLMs were discovering (some of) the same set of principles that allows our brains to construct/use our language rather than creating completely new cognitive structures. That’s actually reminiscent of alphazero converging toward human style without training on human input.)

Yes, there are similar results in a bunch of other domains, including vision, see for a review e.g. The neuroconnectionist research programme

I wouldn't interpret this as necessarily limiting the space of AI values, but rather (somewhat conservatively) as shared (linguistic) features between humans and AIs, some/many of which are probably relevant for alignment.

I fail to see how the latter could arise without the former. Would you mind to connect these dots?

Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.

1Bogdan Ionut Cirstea6mo
And this structure can be used as regularization for soft prompts.

Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs:

Here's a paper which tries to formalize why in-context-learning should be easier with chain-of-thought (than without). 

Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).

In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):

'(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned wi... (read more)

Here's one potential reason why this works and a list of neuroscience papers which empirically show linearity between LLMs and human linguistic representations. 

Given the deep similarities between biological nets and LLMs, I wonder if a technique similar to "activation engineering" could be used for robust mind control and/or brainwashing. 
4Max H7mo
These papers are interesting, thanks for compiling them! Skimming through some of them, the sense I get is that they provide evidence for the claim that the structure and function of LLMs is similar to (and inspired by) the structure of particular components of human brains, namely, the components which do language processing.  This is slightly different from the claim I am making, which is about how the cognition of LLMs compares to the cognition of human brains as a whole. My comparison is slightly unfair, since I'm comparing a single forward pass through an LLM to get a prediction of the next token, to a human tasked with writing down an explicit probability distribution on the next token, given time to think, research, etc. [1] Also, LLM capability at language processing / text generation is already far superhuman (by some metrics). The architecture of LLMs may be simpler than the comparable parts of the brain's architecture in some ways, but the LLM version can run with far more precision / scale / speed than a human brain. Whether or not LLMs are already exceeding human brains by specific metrics is debatable / questionable, but they are not bottlenecked on further scaling by biology. And this is to say nothing of all the other kinds of cognition that happens in the brain. I see these brain components as analogous to LangChain or AutoGPT, if LangChain or AutoGPT themselves were written as ANNs that interfaced "natively" with the transformers of an LLM, instead of as Python code. Finally, similarity of structure doesn't imply similarity of function. I elaborated a bit on this in a comment thread here.   1. ^ You might be able to get better predictions from an LLM by giving it more "time to think", using chain-of-thought prompting or other methods. But these are methods humans use when using LLMs as a tool, rather than ideas which originate from within the LLM itself, so I don't think it's exactly fair to call them "LLM cognition" on their own.
3Garrett Baker7mo

Thanks for engaging. Can you say more about which papers you've looked at / in which ways they seemed very weak? This will help me adjust what papers I'll send; otherwise, I'm happy to send a long list.

Also, to be clear, I don't think any specific paper is definitive evidence, I'm mostly swayed by the cumulated evidence from all the work I've seen (dozens of papers), with varying methodologies, neuroimaging modalities, etc.

Alas, I can't find the one or two that I looked at quickly. It came up in a recent Twitter conversation, I think with Quintin?
3Garrett Baker7mo
Can't speak for Habryka, but I would be interested in just seeing the long list.

I think there's a lot of cumulated evidence pointing against the view that LLMs are (very) alien and pointing towards their semantics being quite similar to those of humans (though of course not identical). E.g. have a look at papers (comparing brains to LLMs) from the labs of Ev Fedorenko, Uri Hasson, Jean-Remi King, Alex Huth (or twitter thread summaries).

Can you link to some specific papers here? I've looked into 1-2 papers of this genre in the last few months, and they seemed very weak to me, but you might have links to better papers, and I would be interested in checking them out.
5the gears to ascension7mo
they're somewhat alien, not highly alien, agreed

Related - context distillation / prompt compression, perhaps recursively too - Learning to Compress Prompts with Gist Tokens.

Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.

2Thane Ruthenis7mo
My opinion is that it's caused by internal limitations placed on the general-intelligence component (see footnote 2). Autistic people can reason about deception formally, same as anybody, but they can't easily translate that understanding into practical social acumen, because humans don't have write-access to their instincts/shards/System 1. And they have worse instincts in the social domain to begin with because of... genes that codify nonstandard reward/reinforcement circuitry, I assume? Suppose that in a median person, there's circuitry that reinforces cognition that is upstream of some good social consequences, like making a person smile. That gradually causes the accumulation of crystallized-intelligence structures/shards specialized for social interactions. Autistic people lack this signal, or receive weaker reinforcement from it[1]. Thus, by default, they fail to develop much System-1 expertise for this domain. They can then compensate for it by solving the domain "manually" using their fully general intelligence. They construct good heuristics, commit them to memory, and learn to fire them when appropriate — essentially replicating by-hand the work that's done automatically in the neurotypical people's case. Or so my half-educated guess goes. I don't have much expertise here, beyond reading some Scott Alexander. @cfoster0, want to weigh in here? As to superintelligent AGIs, they would be (1) less limited in their ability to directly rewrite their System-1-equivalent (their GI components would have more privileges over their minds), (2) much better at solving domains "manually" and generating heuristics "manually". So even if we do hamstring our AGI's ability to learn e. g. manipulation skills, it'll likely be able to figure them out on its own, once it's at human+ level of capabilities. 1. ^ Reminder that reward is not the optimization target. What I'm stating here is not exactly "autistic people don't find social interactions pleasant so they

It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.

Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and c... (read more)

Interesting point.  Though I suspect—partly using myself as an example (I scored 33 on the Autism Spectrum Quotient, and for math I'll mention qualifying for USAMO 3 times)—that these autistic mathematician types, while disinclined to be deceptive (likely finding it abhorrent, possibly having strong ethical stances about it), are still able to reason about deception in the abstract: e.g. if you give them logic puzzles involving liars, or detective scenarios where someone's story is inconsistent with some of the evidence, they'll probably do well at them.  Or, if you say "For April Fool's, we'll pretend we're doing X", or "We need to pretend to the Nazis that we're doing X", they can meticulously figure out all the details that X implies and come up with plausible justifications where needed. In other words, although they're probably disinclined to lie and unpracticed at it, if they do decide to do it, I think they can do it, and there are aspects of constructing a plausible, mostly-consistent lie that they're likely extremely good at.

Related - I'd be excited to see connectome studies on how mice are mechanistically capable of empathy; this (+ computational models) seems like it should be in the window of feasibility given e.g. Towards a Foundation Model of the Mouse Visual Cortex: 'We applied the foundation model to the MICrONS dataset: a study of the brain that integrates structure with function at unprecedented scale, containing nanometer-scale morphology, connectivity with >500,000,000 synapses, and function of >70,000 neurons within a ∼ 1mm3 volume spanning multiple areas of ... (read more)

Another reason to expect approximate linearity in deep learning models: point 12 + arguments about approximate (linear) isomorphism between human and artificial representations (e.g. search for 'isomorph' in Understanding models understanding language and in Grounding the Vector Space of an Octopus: Word Meaning from Raw Text).

It seems to me the the results here that 'instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former' could be interpreted as some positive evidence for the optimistic case (and perhaps more broadly, for 'Do What I Mean' being not-too-hard); summary twitter thread, see especially tweets 4 and 5 

Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'

Leetcode questions are not selected for novelty. In fact, the best way to get a problem turned into a Leetcode question is to post it to Leetcode's discussion board and say someone asked you it in an interview at a big tech company. So it's still possible that some or even many these questions appear nearly verbatim in the training data.

Good point. It's a bit weird that performance on easy Codeforces questions is so bad (0/10) though.

Load More