Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

11Jameson Quinn
This is the second time I've seen this. Now it seems obvious. I remember liking it the first time, but also remember it being obvious. That second part of the memory is probably false. I think it's likely that this explained the idea so well that I now think it's obvious. In other words: very well done.
Customize
Elizabeth70
0
According to a friend of mine in AI, there's a substantial contingent of people who got into AI safety via Friendship is Optimal. They will only reveal this after several drinks. Until then, they will cite HPMOR. Which means we are probably overvaluing HPMOR and undervaluing FIO.
Sam Marks*380
4
I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers. The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings: 1. Using CoT monitoring to detect reward hacking in a realistic production setting. 2. Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT. Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways: 1. The agent is not a frontier model. 2. The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to). 3. [Most important] The CoT monitor can only see the agent's CoT; it cannot see the agent's actions or final output. Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the "edit tests" reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.) Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice d
Lorxus30
0
A snowclone summarizing a handful of baseline important questions-to-self: "What is the state of your X, and why is that what your X's state is?" Obviously also versions that are less generally and more naturally phrased, that's just the most obviously parametrized form of the snowclone. Classic(?) examples: "What do you (think you) know, and why do you (think you) know it?" (X = knowledge/belief) "What are you doing, and why are you doing it?" (X = action(-direction?)/motivation?) Less classic examples that I recognized or just made up: "How do you feel, and why do you feel that way?" (X = feelings/emotions) "What do you want, and why do you want it?" (X = goal/desire) "Who do you know here, and how do you know them?" (X = social graph?) "What's the plan here, and what are you hoping to achieve by that plan?" (X = plan)
Fabien RogerΩ2954-5
22
I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years. The experiments I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models: * Have relatively high next-token-prediction losses * Can't speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish). This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix. I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs. These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn't teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL. Context I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages. I did not carefully run any of the experiments above, and I welcome replications. Related observations from the literature I think there is also a somewhat common miscon
Via David Gerard's forum, I learned of a recent article called "The questions ChatGPT shouldn't answer". It's a study of how ChatGPT replies to ethical dilemmas, written with an eye on OpenAI's recent Model Spec, and the author's conclusion is that AI shouldn't answer ethical questions at all, because (my paraphrase) ethical intelligence is acquired by learning how to live, and of course that's not how current AI acquires its ethical opinions.  Incidentally, don't read this article expecting scholarship; it's basically a sarcastic op-ed. I was inspired to see if GPT-4o could reproduce the author's own moral framework. It tried, but its imitations of her tone stood out more. My experiment was even less scientific and systematic than hers, and yet I found her article, and 4o's imitation, tickling my intuition in a way I wish I had time to overthink.  To begin with, it would be good to understand better, what is going on when our AIs produce ethical discourse or adopt a style of writing, so that we really understand how it differs from the way that humans do it. The humanist critics of AI are right enough when they point out that AI lacks almost everything that humans draw upon. But their favorite explanation of the mechanism that AI does employ is just "autocomplete". Eventually they'll have to develop a more sophisticated account, perhaps drawing upon some of the work in AI interpretability. But is interpretability research anywhere near explaining an AI's metaethics or its literary style?  Thirty years ago Bruce Sterling gave a speech in which he said that he wouldn't want to talk to an AI about its "bogus humanity", he would want the machine to be honest with him about its mechanism, its "social interaction engine". But that was the era of old-fashioned rule-based AI. Now we have AIs which can talk about their supposed mechanism, as glibly as they can pretend to have a family, a job, and a life. But the talk about the mechanism is no more honest than the human i

Popular Comments

Recent Discussion

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.

Summary

This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions based entirely on specific attributes (e.g. nationality, gender) while rarely articulating these attributes as factors...

gwern20

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, "the inner-monologue is unfaithful because in this setting, it is simply gene... (read more)

2Andy Arditi
Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we're doing RL on doesn't require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0), and so we shouldn't expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn't outperform the pre-RL model on some capability benchmark - perhaps even the post-RL model will be a bit dumber after being trained on these silly biases. Possibly! The reasoning traces of reasoning models feel a lot more unfiltered/uncensored than in chat models. Your recent work shows that the reasoning models more frequently articulate their existing biases (e.g. sycophancy, or deferring to authority). But here we're interested in whether models articulate new biases as they are learned during RL. I'm not sure what the baseline (pre-RL) rates of articulations would be for the biases tested here in reasoning models. I'd guess that they are still ~0 for the protected attributes though (e.g. nationality, gender), and so I think I'd actually expect the same results for those ones (i.e. ~0 articulation rate post-RL). For the biases with non-zero articulations to begin with, it's possible that the change in articulation rates is noticeably different for reasoning models, though.
This is a linkpost for https://arxiv.org/pdf/2411.02478

Authors:

Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Abstract:

Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has...

I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.

Wisdom here is an aid to fulfilling your values, not a definition of those values


I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.

female partners are typically notoriously high maintenance in money, attention, and emotional labor.

 

That's the stereotype, but men are the ones who die sooner if divorced, which suggests they're getting a lot out of marriage. 

3VivaLaPanda
Worth noting that this pattern occurs among gay couples as well! (i.e. sexless long-term-relationship, where one party is unhappy about this).  I think that conflict in desires/values is inherent in all relationship, and long-term-relationships have more room for conflict because they involve a closer/longer relationship. Sex drive is a major area where partners tend to diverge especially frequently (probably just for biological reasons in het couples).  It's not obvious to me that sex in marriages needs much special explanation beyond the above. Unless of course the confusion is just "why don't people immediately end all relationships whenever their desires conflict with those of their counterparty".
2Viliam
A general source of problems is that when people try to get a new partner, they try to be... more appealing than usual, in various ways. Which means that after the partner is secured, the behavior reverts to the norm, which is often a disappointment. One way how people try to impress their partners is that the one with lower sexual drive pretends to be more enthusiastic about sex than they actually are in long term. So the moment one partner goes "amazing, now I finally have someone who is happy to do X every day or week", the other partner goes "okay, now that the courtship phase is over, I guess I no longer have to do X every day or week". There are also specific excuses in heterosexual couples, like the girl pretending that she is actually super into doing sex whenever possible, it's just that she is too worried about accidental pregnancy or her reputation... and when these things finally get out of the way, it turns out that it was just an excuse. Perhaps the polyamorous people keep themselves in better shape, but I suspect that they have similar problems, only instead of "my partner no longer wants to do X" it is "my partner no longer wants to do X with me".

According to a friend of mine in AI, there's a substantial contingent of people who got into AI safety via Friendship is Optimal. They will only reveal this after several drinks. Until then, they will cite HPMOR. Which means we are probably overvaluing HPMOR and undervaluing FIO.

According to Nick Cammarata, who rubs shoulders with mystics and billionaires, arahants are as rare as billionaires.

This surprised me, because there are 2+[1] thoroughly-awakened people in my social circle. And that's just in meatspace. Online, I've interacted with a couple others. Plus I met someone with Stream Entry at Less Online last year. That brings the total to a minimum of 4, but it's probably at least 6+.

Meanwhile, I know 0 billionaires.

The explanation for this is obvious; billionaires cluster, as do mystics. If I were a billionaire, then it would be strange for me to not have rubbed shoulders with at least a 6+ other billionaires.

But there's a big difference between billionaires and arahants; it's easy to prove you're a billionaire; just throw a party on your private...

4Richard_Kennaway
To. me, the Sun etc. are out there. My perceptions of them are in here. As anyone with consciousness of abstraction knows at a gut level, the perception is not the thing that gave rise to that perception. My perceptions are a part of myself. The Sun is not.
3Declan Molony
A unique writing motif that I haven't seen before. I think this is cool for a post related to meditation!
1danielechlin
Adhd is not legible just by being in the same room as someone.
lsusr20

It is when she acts like she has ADHD and tells you she has ADHD and .

Lorxus30

A snowclone summarizing a handful of baseline important questions-to-self: "What is the state of your X, and why is that what your X's state is?" Obviously also versions that are less generally and more naturally phrased, that's just the most obviously parametrized form of the snowclone.

Classic(?) examples:
"What do you (think you) know, and why do you (think you) know it?" (X = knowledge/belief)
"What are you doing, and why are you doing it?" (X = action(-direction?)/motivation?)

Less classic examples that I recognized or just made up:
"How do you feel, and w... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Further reading and references

  • Corrigibility by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong.

Many of the most devastating human diseases are heritable. Whether an individual develops schizophrenia, obesity, diabetes, or autism depends more on their genes (and epi-genome) than it does on any other factor. However, current genetic models do not explain all of this heritability. Twin studies show that schizophrenia, for example, is 80% heritable (over a broad cohort of Americans), but our best genetic model only explains ~9% of variance in cases. To be fair, we selected a dramatic example here (models for other diseases perform far better). Still, the gap between heritability and prediction stands in the way of personalized genetic medicine.

We are launching Tabula Bio to close this gap. We have a three-part thesis on how to approach this. 

  1. The path forward is machine learning. The human genome
...

Abstract

Once AI systems can design and build even more capable AI systems, we could see an intelligence explosion, where AI capabilities rapidly increase to well past human performance.

The classic intelligence explosion scenario involves a feedback loop where AI improves AI software. But AI could also improve other inputs to AI development. This paper analyses three feedback loops in AI development: software, chip technology, and chip production. These could drive three types of intelligence explosion: a software intelligence explosion driven by software improvements alone; an AI-technology intelligence explosion driven by both software and chip technology improvements; and a full-stack intelligence explosion incorporating all three feedback loops.

Even if a software intelligence explosion never materializes or plateaus quickly, AI-technology and full-stack intelligence explosions remain possible. And, while these would start...

The first year or two of human learning seem optimized enough that they're mostly in evolutionary equilibrium - see Henrich's discussion of the similarities to chimpanzees in The Secret of Our Success.

Human learning around age 10 is presumably far from equilibrium.

I'll guess that I see more of the valuable learning taking place in the first 2 years or so than do other people here.