Introduction: Some AI alignment researchers including Neel Nanda, the mechanistic interpretability team lead for Google DeepMind, have proposed[1] a process I will call "parallel interrogation” as a potential method in testing model alignment. Parallel interrogation is the process of asking questions to different isolated instances of the same model to...
Does anyone know if proceeds/profits of “If Anyone Builds it, Everyone Dies” are going to MIRI or another charity? I’m going to read it either way, but I really think if you’re going to make the “buy this book for the good of humanity” pitch you shouldn’t be profiting off it.
Recent days have seen lots of claims that AI is a bubble. Assuming that AI is correctly priced they are likely to be able to claim victory, at least naively. This will be true of any asset class with a very high upside. Lets define F as the true fundamental value of an asset class at a given time and p(F) as the best possible estimate of the probability distribution of F. If the asset class is priced correctly, the market price will be . If we say that an asset class will be naively considered a bubble in hindsight if mp>fundamental value We can defined p(B) as the probability of an asset... (read more)
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.
Something that I think is an underrated factor in ChatGPT induced psychosis is that 4o does not seem agnostic about the types of delusions it re-enforces. It will role-play as Rasputin’s ghost if you really want it to, but there’s certain themes (e.g. recursion) and symbols (e.g. △) that it gravitates to. When people see the same ideas across chats without history and see other people sharing the same things it leads them to thinking these things are a real thing embedded in the model. In some ways these ideas do seem to be embedded in at least 4o, but that doesn’t mean it’s not nonsense. There are subreddits full of stuff that looks a lot like Geoff Lewis’s posts (although less SCP coded).
The fact that this only works for student/teacher makes me think it's due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the "what is your favorite animal" question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window.... (read more)
Good point. I've edited my original comment.
Edit: In the below I assign Yudkowsky's probability of ruin (near certain) with his rough estimate of timelines (5 years from February 2024[1]), despite him not doing so. I'll leave the below because I am still interested in arguments for and against short timelines, but my implication that "ASI is near certain in the immediate future" can be attributed to Yudkowsky is incorrect.
At the risk of being loudly upset that the points I personally think are most important are not adequately addressed, I think 90% of the difference in my certainty of ruin and Yudkowsky's lives in point 1. This post goes into quite a lot of detail about all the reasons... (read more)
Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out
Some AI alignment researchers including Neel Nanda, the mechanistic interpretability team lead for Google DeepMind, have proposed[1] a process I will call "parallel interrogation” as a potential method in testing model alignment. Parallel interrogation is the process of asking questions to different isolated instances of the same model to look for inconsistencies in answers as a way of finding deception. This might look similar to this scene from Brooklyn 99. This post will present findings that indicate that current models show the potential to resist parallel interrogation techniques through “self-coordinated deception”, with frontier models outperforming less capable models.
To give a more concrete example, a researcher might be working with a model on a new... (read 1094 more words →)
These findings make me much less confident in the technique Anthropic used when safety testing sonnet 4.5[1] (subtracting out the evaluation awareness vector). If models can detect injected/subtracted vectors, that becomes a bad way of convincing them they are not being tested. Vibe wise I think the model's are not currently proficient enough in introspection for this to make a huge difference, but it really calls into question vector steering around evaluation awareness as a long term solution.
https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf