LESSWRONG
LW

Avi Brach-Neufeld — LessWrong

These findings make me much less confident in the technique Anthropic used when safety testing sonnet 4.5^[1] (subtracting out the evaluation awareness vector). If models can detect injected/subtracted vectors, that becomes a bad way of convincing them they are not being tested. Vibe wise I think the model's are not currently proficient enough in introspection for this to make a huge difference, but it really calls into question vector steering around evaluation awareness as a long term solution.

^{^}
https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf

Avi Brach-Neufeld5moQuick Take

Does anyone know if proceeds/profits of “If Anyone Builds it, Everyone Dies” are going to MIRI or another charity? I’m going to read it either way, but I really think if you’re going to make the “buy this book for the good of humanity” pitch you shouldn’t be profiting off it.

Avi Brach-Neufeld's Shortform

Avi Brach-Neufeld

6mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Avi Brach-Neufeld6moQuick Take

Recent days have seen lots of claims that AI is a bubble. Assuming that AI is correctly priced they are likely to be able to claim victory, at least naively. This will be true of any asset class with a very high upside. Lets define F as the true fundamental value of an asset class at a given time and p(F) as the best possible estimate of the probability distribution of F. If the asset class is priced correctly, the market price will be $m p = E (F) = \int_{0}^{inf} p (F) F d F$ . If we say that an asset class will be naively considered a bubble in hindsight if mp>fundamental value We can defined p(B) as the probability of an asset... (read more)

Replying toSubliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Avi Brach-Neufeld7mo

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.

I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.

Replying toOn "ChatGPT Psychosis" and LLM Sycophancy

Avi Brach-Neufeld7mo

On "ChatGPT Psychosis" and LLM Sycophancy

Something that I think is an underrated factor in ChatGPT induced psychosis is that 4o does not seem agnostic about the types of delusions it re-enforces. It will role-play as Rasputin’s ghost if you really want it to, but there’s certain themes (e.g. recursion) and symbols (e.g. △) that it gravitates to. When people see the same ideas across chats without history and see other people sharing the same things it leads them to thinking these things are a real thing embedded in the model. In some ways these ideas do seem to be embedded in at least 4o, but that doesn’t mean it’s not nonsense. There are subreddits full of stuff that looks a lot like Geoff Lewis’s posts (although less SCP coded).

Replying toSubliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Avi Brach-Neufeld7mo*

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

The fact that this only works for student/teacher makes me think it's due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the "what is your favorite animal" question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window.... (read more)

Replying toAGI Ruin: A List of Lethalities

Avi Brach-Neufeld8mo

AGI Ruin: A List of Lethalities

Good point. I've edited my original comment.

Replying toAGI Ruin: A List of Lethalities

Avi Brach-Neufeld8mo*

AGI Ruin: A List of Lethalities

Edit: In the below I assign Yudkowsky's probability of ruin (near certain) with his rough estimate of timelines (5 years from February 2024^[1]), despite him not doing so. I'll leave the below because I am still interested in arguments for and against short timelines, but my implication that "ASI is near certain in the immediate future" can be attributed to Yudkowsky is incorrect.

At the risk of being loudly upset that the points I personally think are most important are not adequately addressed, I think 90% of the difference in my certainty of ruin and Yudkowsky's lives in point 1. This post goes into quite a lot of detail about all the reasons... (read more)

Replying toSelf-Coordinated Deception in Current AI Models

Avi Brach-Neufeld8mo

Self-Coordinated Deception in Current AI Models

Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out

Self-Coordinated Deception in Current AI Models

Avi Brach-Neufeld

8mo

Introduction:

Some AI alignment researchers including Neel Nanda, the mechanistic interpretability team lead for Google DeepMind, have proposed^[1] a process I will call "parallel interrogation” as a potential method in testing model alignment. Parallel interrogation is the process of asking questions to different isolated instances of the same model to look for inconsistencies in answers as a way of finding deception. This might look similar to this scene from Brooklyn 99. This post will present findings that indicate that current models show the potential to resist parallel interrogation techniques through “self-coordinated deception”, with frontier models outperforming less capable models.

To give a more concrete example, a researcher might be working with a model on a new... (read 1094 more words →)