...Here is the specific confusion that matters for our purposes. When someone says "a rational agent maximizes expected utility," this sounds, to a casual listener, like it means "a rational agent computes the probability-weighted average of their subjective values across all possible outcomes." In other words, it sounds like the agent takes f1, the function representing how good each outcome feels or how much they value it, and averages it across possible worlds, weighted by probability. This would mean that the agent literally values a gamble at the weighte
...I think the most natural fix within the VNM theory is to just say S' and D' are the events "car is awarded so son/daughter based on a coin toss", which are slightly better than S and D themselves, and that F is really 0.5S' + 0.5D'. Unfortunately, such modifications undermine the applicability of the VNM theorem, which implicitly assumes that the source of probabilities itself is insignificant to the outcomes for the agent. Luckily, Bolker4 has divised an axiomatic theory whose theorems will apply without such assumptions, at the expense of some uniquenes
Coauthored by Fedor Ryzhenkov and Dmitrii Volkov (Palisade Research)
At Palisade, we often discuss latest safety results with policymakers and think tanks who seek to understand the state of current technology. This document condenses and streamlines the various internal notes we wrote when discussing Anthropic's "Scaling Monosemanticity".
Research on AI interpretability aims to unveil the inner workings of AI models, traditionally seen as “black boxes.” This enhances our understanding, enabling us to make AI safer, more predictable, and more efficient. Anthropic’s Transformer Circuits Thread focuses on mechanistic (bottom-up) interpretability of AI models.
Their latest result, Scaling Monosemanticity, demonstrates how interpretability techniques that worked for small, shallow models can scale to practical 7B (GPT-3.5-class) models. This paper also paves the way for applying similar methods to larger frontier models (GPT-4 and beyond).