This is a special post for short-form writing by Bogdan Ionut Cirstea. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
10 comments, sorted by Click to highlight new comments since: Today at 3:08 PM

Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. and about how gradient descent is supposed to be nothing like that I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.

hmm. i think you're missing eliezer's point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.

But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.

I don't think having a beauty-detector that works the same way humans' beauty-detectors do implies that you care about beauty?

Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g.

Even if Eliezer's argument in that Twitter thread is completely worthless, it remains the case that "merely hoping" that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as "merely hoping" the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you?

Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?

Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws ( + LM embeddings as model of shared linguistic space for transmitting thoughts during communication ( suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).

More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I'd bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la

Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species ( and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (

And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality - e.g. results on nonlinear ICA in single-modal vs. multi-modal settings This might a way to bypass single-model interpretability difficulties, by e.g. 'comparing' to brains or to other models.

Example of cross-species application: empathy mechanisms seem conserved across species Example of brain-ANN applications: 'matching' to modular brain networks, e.g. language network - ontology-relevant, non-agentic (e.g. or Theory of Mind network - could be very useful for detecting deception-relevant circuits (e.g.

Examples of related interpretability across models, across brain measurement modalities, across animals and brain-ANN

(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')

LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.

Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI's

Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.