LESSWRONG
LW

Avi Brach-Neufeld
41270
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Avi Brach-Neufeld's Shortform
14d
1
No wikitag contributions to display.
Avi Brach-Neufeld's Shortform
Avi Brach-Neufeld14d10

Recent days have seen lots of claims that AI is a bubble. Assuming that AI is correctly priced they are likely to be able to claim victory, at least naively. This will be true of any asset class with a very high upside. Lets define F as the true fundamental value of an asset class at a given time and p(F) as the best possible estimate of the probability distribution of F. If the asset class is priced correctly, the market price will be mp=E(F)=∫inf0p(F)FdF. If we say that an asset class will be naively considered a bubble in hindsight if mp>fundamental value We can defined p(B) as the probability of an asset class to appear to be a bubble in retrospect. P(B)=∫mp0p(F)dF. For example for a probability distribution where 50% of the value lies in the top 10% of best case scenarios, there is a 90% chance that the true fundamental value of the asset class is below the current market price. To really determine if there was a bubble you would need to deeply research the topic to attempt to determine if the market price at the time was in line with the expected value of the fundamental value given the information available at the time.

Reply
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Avi Brach-Neufeld1mo31

Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.

I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.
 

Reply
On "ChatGPT Psychosis" and LLM Sycophancy
Avi Brach-Neufeld1mo131

Something that I think is an underrated factor in ChatGPT induced psychosis is that 4o does not seem agnostic about the types of delusions it re-enforces. It will role-play as Rasputin’s ghost if you really want it to, but there’s certain themes (e.g. recursion) and symbols (e.g. △) that it gravitates to. When people see the same ideas across chats without history and see other people sharing the same things it leads them to thinking these things are a real thing embedded in the model. In some ways these ideas do seem to be embedded in at least 4o, but that doesn’t mean it’s not nonsense. There are subreddits full of stuff that looks a lot like Geoff Lewis’s posts (although less SCP coded).

Reply
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Avi Brach-Neufeld1mo*154

The fact that this only works for student/teacher makes me think it's due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the "what is your favorite animal" question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window. Either way, when you ask the teacher for a number it says 372.

The student then is fine tuned to choose the number 372. This makes the owl/372 neuron have a lower barrier to fire. Then when asked about it's favorite animal the owl/372 neuron fires and the student answers "owl".

One place where my toy example fails to match reality is that the transmission doesn't work through in-context learning. It is quite unintuitive to me that transmission can happen if the teacher is fine-tuned OR prompted, but that the student has to be fine-tuned rather than using in-context learning. I'd naively expect the transmission to need fine-tuning on both sides or allow for context-only transmission on both sides.
 

Reply1
AGI Ruin: A List of Lethalities
Avi Brach-Neufeld3mo10

Good point. I've edited my original comment.

Reply
AGI Ruin: A List of Lethalities
Avi Brach-Neufeld3mo*70

Edit: In the below I assign Yudkowsky's probability of ruin (near certain) with his rough estimate of timelines (5 years from February 2024[1]), despite him not doing so. I'll leave the below because I am still interested in arguments for and against short timelines, but my implication that "ASI is near certain in the immediate future" can be attributed to Yudkowsky is incorrect.

At the risk of being loudly upset that the points I personally think are most important are not adequately addressed, I think 90% of the difference in my certainty of ruin and Yudkowsky's lives in point 1. This post goes into quite a lot of detail about all the reasons that a cognitive system with sufficiently high cognitive powers leads to ruins, but seems to gloss over how we get there. Alpha Zero was able to improve so rapidly because self-play in go has clear rules, a perfectly defined reward function, a tight feedback loop, and a guaranteed reward for one player every time through the feedback loop.

The face that we are still alive today seems to be strong evidence that we are in a different paradigm than the day it took Alpha Zero to blow past human ability.

Without that rich RL feedback loop, I think the path to super intelligence is much less certain. We have made quick progress over the last three years first by scaling pre-training compute, then by scaling inference compute, but there is evidence that both are leveling off. Now I think an intelligence explosion like the one described in AI 2027 is very possible, but still likely requires future algorithmic breakthroughs by human researchers (admittedly aided by increasingly capable AI assistants).

If anyone has links to especially strong arguments for why ASI is near certain in the immediate future, please send them my way as I'd love to understand where Yudkowsky's certainty comes from.

  1. ^

    https://www.theguardian.com/technology/2024/feb/17/humanitys-remaining-timeline-it-looks-more-like-five-years-than-50-meet-the-neo-luddites-warning-of-an-ai-apocalypse

Reply
Self-Coordinated Deception in Current AI Models
Avi Brach-Neufeld3mo10

Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out

Reply
2Avi Brach-Neufeld's Shortform
14d
1
8Self-Coordinated Deception in Current AI Models
3mo
5