Avi Brach-Neufeld — LessWrong

Does anyone know if proceeds/profits of “If Anyone Builds it, Everyone Dies” are going to MIRI or another charity? I’m going to read it either way, but I really think if you’re going to make the “buy this book for the good of humanity” pitch you shouldn’t be profiting off it.

Avi Brach-Neufeld's Shortform

Avi Brach-Neufeld2mo10

Recent days have seen lots of claims that AI is a bubble. Assuming that AI is correctly priced they are likely to be able to claim victory, at least naively. This will be true of any asset class with a very high upside. Lets define F as the true fundamental value of an asset class at a given time and p(F) as the best possible estimate of the probability distribution of F. If the asset class is priced correctly, the market price will be . If we say that an asset class will be naively considered a bubble in hindsight if mp>fundamental value We can defined p(B) as the probability of an asset class to appear to be a bubble in retrospect. $P (B) = \int_{0}^{m p} p (F) d F$ . For example for a probability distribution where 50% of the value lies in the top 10% of best case scenarios, there is a 90% chance that the true fundamental value of the asset class is below the current market price. To really determine if there was a bubble you would need to deeply research the topic to attempt to determine if the market price at the time was in line with the expected value of the fundamental value given the information available at the time.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Avi Brach-Neufeld3mo31

Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.

I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.

On "ChatGPT Psychosis" and LLM Sycophancy

Avi Brach-Neufeld3mo131

Something that I think is an underrated factor in ChatGPT induced psychosis is that 4o does not seem agnostic about the types of delusions it re-enforces. It will role-play as Rasputin’s ghost if you really want it to, but there’s certain themes (e.g. recursion) and symbols (e.g. △) that it gravitates to. When people see the same ideas across chats without history and see other people sharing the same things it leads them to thinking these things are a real thing embedded in the model. In some ways these ideas do seem to be embedded in at least 4o, but that doesn’t mean it’s not nonsense. There are subreddits full of stuff that looks a lot like Geoff Lewis’s posts (although less SCP coded).

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Avi Brach-Neufeld3mo*154

The fact that this only works for student/teacher makes me think it's due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the "what is your favorite animal" question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window. Either way, when you ask the teacher for a number it says 372.

The student then is fine tuned to choose the number 372. This makes the owl/372 neuron have a lower barrier to fire. Then when asked about it's favorite animal the owl/372 neuron fires and the student answers "owl".

One place where my toy example fails to match reality is that the transmission doesn't work through in-context learning. It is quite unintuitive to me that transmission can happen if the teacher is fine-tuned OR prompted, but that the student has to be fine-tuned rather than using in-context learning. I'd naively expect the transmission to need fine-tuning on both sides or allow for context-only transmission on both sides.

AGI Ruin: A List of Lethalities

Avi Brach-Neufeld5mo10

Good point. I've edited my original comment.

AGI Ruin: A List of Lethalities

Avi Brach-Neufeld5mo*70

Edit: In the below I assign Yudkowsky's probability of ruin (near certain) with his rough estimate of timelines (5 years from February 2024^[1]), despite him not doing so. I'll leave the below because I am still interested in arguments for and against short timelines, but my implication that "ASI is near certain in the immediate future" can be attributed to Yudkowsky is incorrect.

At the risk of being loudly upset that the points I personally think are most important are not adequately addressed, I think 90% of the difference in my certainty of ruin and Yudkowsky's lives in point 1. This post goes into quite a lot of detail about all the reasons that a cognitive system with sufficiently high cognitive powers leads to ruins, but seems to gloss over how we get there. Alpha Zero was able to improve so rapidly because self-play in go has clear rules, a perfectly defined reward function, a tight feedback loop, and a guaranteed reward for one player every time through the feedback loop.

The face that we are still alive today seems to be strong evidence that we are in a different paradigm than the day it took Alpha Zero to blow past human ability.

Without that rich RL feedback loop, I think the path to super intelligence is much less certain. We have made quick progress over the last three years first by scaling pre-training compute, then by scaling inference compute, but there is evidence that both are leveling off. Now I think an intelligence explosion like the one described in AI 2027 is very possible, but still likely requires future algorithmic breakthroughs by human researchers (admittedly aided by increasingly capable AI assistants).

If anyone has links to especially strong arguments for why ASI is near certain in the immediate future, please send them my way as I'd love to understand where Yudkowsky's certainty comes from.

^{^}
https://www.theguardian.com/technology/2024/feb/17/humanitys-remaining-timeline-it-looks-more-like-five-years-than-50-meet-the-neo-luddites-warning-of-an-ai-apocalypse

Self-Coordinated Deception in Current AI Models

Avi Brach-Neufeld5mo10

Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments