Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.
This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.
This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions based entirely on specific attributes (e.g. nationality, gender) while rarely articulating these attributes as factors...
In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, "the inner-monologue is unfaithful because in this setting, it is simply gene...
Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann
Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has...
I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.
Wisdom here is an aid to fulfilling your values, not a definition of those values
I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.
female partners are typically notoriously high maintenance in money, attention, and emotional labor.
That's the stereotype, but men are the ones who die sooner if divorced, which suggests they're getting a lot out of marriage.
According to Nick Cammarata, who rubs shoulders with mystics and billionaires, arahants are as rare as billionaires.
This surprised me, because there are 2+[1] thoroughly-awakened people in my social circle. And that's just in meatspace. Online, I've interacted with a couple others. Plus I met someone with Stream Entry at Less Online last year. That brings the total to a minimum of 4, but it's probably at least 6+.
Meanwhile, I know 0 billionaires.
The explanation for this is obvious; billionaires cluster, as do mystics. If I were a billionaire, then it would be strange for me to not have rubbed shoulders with at least a 6+ other billionaires.
But there's a big difference between billionaires and arahants; it's easy to prove you're a billionaire; just throw a party on your private...
It is when she acts like she has ADHD and tells you she has ADHD and .
A snowclone summarizing a handful of baseline important questions-to-self: "What is the state of your X, and why is that what your X's state is?" Obviously also versions that are less generally and more naturally phrased, that's just the most obviously parametrized form of the snowclone.
Classic(?) examples:
"What do you (think you) know, and why do you (think you) know it?" (X = knowledge/belief)
"What are you doing, and why are you doing it?" (X = action(-direction?)/motivation?)
Less classic examples that I recognized or just made up:
"How do you feel, and w...
Many of the most devastating human diseases are heritable. Whether an individual develops schizophrenia, obesity, diabetes, or autism depends more on their genes (and epi-genome) than it does on any other factor. However, current genetic models do not explain all of this heritability. Twin studies show that schizophrenia, for example, is 80% heritable (over a broad cohort of Americans), but our best genetic model only explains ~9% of variance in cases. To be fair, we selected a dramatic example here (models for other diseases perform far better). Still, the gap between heritability and prediction stands in the way of personalized genetic medicine.
We are launching Tabula Bio to close this gap. We have a three-part thesis on how to approach this.
Once AI systems can design and build even more capable AI systems, we could see an intelligence explosion, where AI capabilities rapidly increase to well past human performance.
The classic intelligence explosion scenario involves a feedback loop where AI improves AI software. But AI could also improve other inputs to AI development. This paper analyses three feedback loops in AI development: software, chip technology, and chip production. These could drive three types of intelligence explosion: a software intelligence explosion driven by software improvements alone; an AI-technology intelligence explosion driven by both software and chip technology improvements; and a full-stack intelligence explosion incorporating all three feedback loops.
Even if a software intelligence explosion never materializes or plateaus quickly, AI-technology and full-stack intelligence explosions remain possible. And, while these would start...
The first year or two of human learning seem optimized enough that they're mostly in evolutionary equilibrium - see Henrich's discussion of the similarities to chimpanzees in The Secret of Our Success.
Human learning around age 10 is presumably far from equilibrium.
I'll guess that I see more of the valuable learning taking place in the first 2 years or so than do other people here.