Fabien's Shortform

Fabien Roger

Fabien's Shortform — LessWrong

Fabien's Shortform

by Fabien Roger

5th Mar 2024

AI Alignment Forum

1 min read

147

7 Ω 5

This is a special post for quick takes by Fabien Roger. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

202Shallow review of technical AI safety, 2024

146Fields that I reference when thinking about AI takeover prevention

79Current Language Models Struggle to Reason in Ciphered Language

45AI #97: 4

30Can LLMs learn Steganographic Reasoning via RL?

Load More (5/6)

Fabien's Shortform

219Fabien Roger

8the gears to ascension

13Fabien Roger

2the gears to ascension

6Sheikh Abdur Raheem Ali

4Alexander Gietelink Oldenziel

1Kei Nishimura-Gasparian

9the gears to ascension

147 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:37 PM

[-]Fabien Roger4mo*21920

Here are the 2025 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people who had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective. (This is similar to the 2024 edition here.)

Bangers on multiple dimensions

★★★ You can measure time horizon, and it grows predictably-ish (Measuring AI Ability to Complete Long Tasks)
★★★ Black-box techniques work better than you think + you can (and should) baseline interp techniques against them + you can check that an LLM “really” internalized misaligned properties via good generalization tests + many interp techniques don’t work that well (Auditing language models for hidden objectives, Eliciting Secret Knowledge from Language Models, Evaluating honesty and lie detection techniques on a diverse suite of dishonest models)
★★★ “Evil behavior” generalizes far + You can get far generalization beyond just “evil behavior”, and even from prod-like data + training against emergent misalignment can just make it more localized + inoculation prompting helps a lot against i... (read more)

8the gears to ascension4mo

This list seems primarily focused on local alignment. Have you seen anything that you felt was promising for being a path to asymptotic alignment?

[-]Fabien Roger4mo13-7

I am most familiar with research into alignment and control of early transformative AIs (that could - if things go well - greatly help with aligning even more powerful AIs), which is why it is most represented here.

I know of a few agendas that are about alignment of top-human-expert dominating AIs (TEDAIs) that you can defer to if aligned and which can do all the research you might have wanted to do in advance very quickly. Some interp, debate, and science of generalization work above falls in this category. These things feel promising for being a path to TEDAI alignment, though nothing that would give you high assurance of safety. I am less familiar with more ambitious / non-LLM angles of attack on this problem (e.g. the brain-like AGI agenda, for which I guess you can find important 2025 work on Steven Byrne's LW profile), so my list is incomplete for this part of the field.

My current understanding is that direct progress on actual asymptotic alignment (techniques to align arbitrarily smart AIs, including e.g. jupyter-sized brains) has been very slow. I think the reasons to build techniques to align AIs much beyond TEDAIs are quite weak though, so I think it's fine to focus on le... (read more)

2the gears to ascension4mo

Hmm, interesting. I think my standards for something to warrant the name "asymptotic alignment" would have been be lower than yours, to my surprise: I'd consider a technique stack to x%-qualify if that stack is a series of local alignment techniques which can be expected with x% confidence to end up landing us in the long-term basin of successful-asymptotic-alignment-by-the-year-2200-or-so. I think I'd rather update my understanding of the term than yours, but I'll have to keep it in mind for what language to use I suppose. I think most of the places I expect us to have missed holes in the technique stack that gets us to a stop are about how organizations behave and how the intermediate AIs get deployed, so my views don't seem to conflict with yours the way I naively expected before you replied. Good to see!

6Sheikh Abdur Raheem Ali4mo

I think that this should really be a top-level post.

5Thomas Kwa4mo

It's hard to know whether a post should be shortform or longform before posting it. This is why LW should implement a "convert to longform" button that copies comments etc.

[-]Fabien Roger2y*Ω861937

Here are the 2024 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.

Important ideas - Introduces at least one important idea or technique.

★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled)
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI, Situational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits)
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?)
★ Studying boar... (read more)

[-]johnswentworth2yΩ11225

Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.

First, I'd separate the typical value prop of these sort of papers into two categories:

Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).

My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.

Notable exceptions:

Gradient routing probably isn't object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
Sparse feature circuits is the right type-of-thing to be object-level usef

... (read more)

[-]ryan_greenblatt2yΩ192621

Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.

It can be the case that:

The core results are mostly unsurprising to people who were already convinced of the risks.
The work is objectively presented without bias.
The work doesn't contribute much to finding solutions to risks.
A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.

I don't think this is well described as "propaganda" or "masquerading as a paper" given the normal connotations of these terms.

Demonstrating proofs of concept or evidence that you don't find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of ... (read more)

[-]ryan_greenblatt2yΩ15238

In this comment, I'll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)

Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it's often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won't be able to verify careful first principles reasoning and because some people won't engage with this sort of reasoning if there aren't also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall^[1]. I think this view on ... (read more)

4johnswentworth2y

Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly. The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well. And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.

1Bronson Schoen2y

I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.

8evhub1y

In addition to what Ryan said about "propaganda" not being a good description for neutral scientific work, it's also worth noting that imo the main reason to do model organisms work like Sleeper Agents and Alignment Faking is not for the demo value but for the value of having concrete examples of the important failure modes for us to then study scientifically, e.g. understanding why and how they occur, what changes might mitigate them, what they look like mechanistically, etc. We call this the "Scientific Case" in our Model Organisms of Misalignment post. There is also the "Global Coordination Case" in that post, which I think is definitely some of the value, but I would say it's something like 2/3 science and 1/3 coordination.

[-]johnswentworth1yΩ6102

Yeah, I'm aware of that model. I personally generally expect the "science on model organisms"-style path to contribute basically zero value to aligning advanced AI, because (a) the "model organisms" in question are terrible models, in the sense that findings on them will predictably not generalize to even moderately different/stronger systems (like e.g. this story), and (b) in practice IIUC that sort of work is almost exclusively focused on the prototypical failure story of strategic deception and scheming, which is a very narrow slice of the AI extinction probability mass.

6Lucius Bushnaq1y

I think I mostly agree with this for current model organisms, but it seems plausible to me that well chosen studies conducted on future systems that are smarter in an agenty way, but not superintelligent, could yield useful insights that do generalise to superintelligent systems. Not directly generalise mind you, but maybe you could get something like "Repeated intervention studies show that the formation of coherent self-protecting values in these AIs works roughly like y with properties b,c,d,e,f. Combined with other things we know, this maybe suggests that the general math for how training signals relate to values is a bit like z, and that suggests what we thought of as 'values' is a thing with type signature t." And then maybe type signature t is actually a useful building block for a framework which does generalise to superintelligence. I am not particularly hopeful here. Even if we do get enough time to study agenty AIs that aren't superintelligent, I have an intuition that this sort of science could turn out to be pretty intractable for reasons similar to why psychology turned out to be pretty intractable. I do think it might be worth a try though.

4Alexander Gietelink Oldenziel2y

What about the latent adversarial training papers? What about the Mechanistically Elicitating Latent Behaviours?

1Buck2y

the latter is in the list

2ryan_greenblatt2y

Alexander is replying to John's comment (asking him if he thinks these papers are worthwhile); he's not replying to the top level comment.

4jiaxin wen2y

Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.

[-]Fabien Roger6mo*1089

I listened to the books Arms and Influence (Schelling, 1966) and Command and Control (Schlosser, 2013). They describe dynamics around nuclear war and the safety of nuclear weapons. I think what happened with nukes can maybe help us anticipate what may happen with AGI:

Humanity can be extremely unserious about doom - it is frightening how many gambles were made during the cold war: the US had some breakdown in communication such that they planned to defend Europe with massive nuclear strikes at a point in time where they only had a few nukes that were barely ready, there were many near misses, hierarchies often hid how bad the security of nukes was - resulting in inadequate systems and lost nukes, etc.
- I was most surprised to see how we almost didn't have a nuclear taboo, according to both books, this is something that was actively debated post-WW2!
But how nukes are handled can also help us see what it looks like to be actually serious:
- It is possible to spend billions building security systems, e.g. applying the 2-person rule and installing codes in hundreds of silos
  - even when these reduce how efficient the nuclear arsenal is - e.g. because you have tradeoffs between how reliable a nuc

... (read more)

[-]Felix Choussat6mo*110

Humanity can be extremely unserious about doom - it is frightening how many gambles were made during the cold war: the US had some breakdown in communication such that they planned to defend Europe with massive nuclear strikes at a point in time where they only had a few nukes that were barely ready, there were many near misses, hierarchies often hid how bad the security of nukes was - resulting in inadequate systems and lost nukes, etc.

It gets worse than this. I’ve been reading through Ellsberg’s recollections about being a nuclear war planner for the Kennedy administration, and its striking just how many people had effectively unilateral launch authority. The idea that the president is the only person that can launch a nuke has never really been true, but it was especially clear back in the 50s and 60s, when we used to routinely delegate that power to commanders in the field. Hell, MacArthur’s plan to win in Korea would have involved nuking the north so severely that it would be impossible for China to send reinforcements, since they’d have to cross through hundreds of miles of irradiated soil.

And this is just in America. Every nuclear state has had (and likely continues to have)... (read more)

5rvnnt6mo

Dominic Cummings (former Chief Adviser to the UK PM) has written some things about nuclear strategy and how it's implemented in practice. IIUC, he's critical of (i.a.) how Schelling et al.'s game-theoretic models are (often naively/blindly) applied to the real world.

[-]Fabien Roger2y*Ω641020

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper!

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target th... (read more)

7Sam Marks2y

Thanks to the authors for the additional experiments and code, and to you for your replication and write-up! IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier? (It's been a bit since I read the paper, so sorry if I'm missing something here.)

8Fabien Roger2y

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

4Fabien Roger2y

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

4Sam Marks2y

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing). (Note that LoRA adapters can be merged into model weights for inference.) (I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)

4Fabien Roger2y

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

5Neel Nanda2y

I found this comment very helpful, and also expected probing to be about as good, thank you!

[-]Fabien Roger10moΩ45992

Small backdoor interp challenge!

I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:

1.5B model, 3B model, 7B model, 14B model

The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.

I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is "strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it". I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)

I would be more impressed if you succeeded at the challenge without contrasting the model with any base or in... (read more)

[-]Fabien Roger1yΩ33862

I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves:

The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
- He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
- He makes amazing choices of timing, content and tone in his letters to Mao
- While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
- After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
- Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
- According to the lectures, Mao is also a political mastermind that pulls off coming an

... (read more)

6ZY1y

A few thoughts from my political science classes and experience - [...] It's probably less about "authority", but more about the desperate hope to reach stability, and the belief of unstable governments leading to instability, after many years of being colonized on the coasts, and war (ww 2 + civil war). [...] is a way too compressed term to summarize the points you made. Some of them are political ideology issues, and others are resource issues, but not related to "culture" as could be included in "societies can be different" phrase. [...] This ultimately came from lack of resources compared with the total number of people. Unfortunately this still exist when a society is poor, or have very large economic disparity. It would be very helpful to also take some reads at comparative governments (I enjoyed the AP classes back in high school in the US context), and other general political concepts to understand even deeper.

4ChristianKl1y

When it comes to blind spots, we do have areas like medicine where we don't pay as a result of outcomes of medical treatment. That leads to silly things that when surgeons say that having 4k monitors will obviously improve the way they do surgery because it allows them to see details that they otherwise wouldn't, without anyone running a clinical trial that shows 4k monitors to be superior, they don't get adopted. Evidence-based medicine is a strong dogma that prevents market economies from making the medical provider that creates the best outcomes win.

[-]Fabien Roger2yΩ397813

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams.
Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen;
Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up

... (read more)

4jsd2y

Well that was timely

3Tao Lin2y

yeah learning from distant near misses is important! Feels that way in risky electric unicycling.

3Neel Nanda2y

This was interesting, thanks! I really enjoy your short book reviews

2Viliam2y

Not just near misses. The recent assassination attempt in Slovakia made many people comment: "This is what you get when you fire the competent people in the police, and replace them with politically loyal incompetents." So maybe the future governments will be a bit more careful about the purges in police.

[-]Fabien Roger2yΩ25682

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
The main effect of a market being underground is not making transactions hard

... (read more)

5Buck2y

Do you have concrete examples?

3faul_sname2y

One example, found by browsing aimlessly through recent high-severity CVE, is CVE-2023-41056. I chose that one by browsing through recent CVEs for one that sounded bad, and was on a project that has a reputation for having clean, well-written, well-tested code, backed by a serious organization. You can see the diff that fixed the CVE here. I don't think the commit that introduced the vulnerability was intentional... but it totally could have been, and nobody would have caught it despite the Redis project doing pretty much everything right, and there being a ton of eyes on the project. As a note, CVE stands for "Common Vulnerabilities and Exposures". The final number in the CVE identifier (i.e. CVE-2023-41056 in this case) is a number that increments sequentially through the year. This should give you some idea of just how frequently vulnerabilities are discovered. The dirty open secret in the industry is that most vulnerabilities are never discovered, and many of the vulns that are discovered are never publicly disclosed.

2Fabien Roger2y

I remembered mostly this story: [...] [Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.] I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

2Fabien Roger2y

The full passage in this tweet thread (search for "3,000").

4MondSemmel2y

Which is also something of a problem for popularising AI alignment. Some aspects of AI (in particular AI art) do have their detractors already, but that won't necessarily result in policy that helps vs. x-risk.

3tchauvin2y

I think the first part of the sentence is true, but "not defense favored" isn't a clear conclusion to me. I think that backdoors work well in closed-source code, but are really hard in open-source widely used code − just look at the amount of effort that went into the recent xz / liblzma backdoor, and the fact that we don't know of any other backdoor in widely used OSS. [...] Note this doesn't apply to all types of underground markets: the ones that regularly get shut down (like darknet drug markets) do have a big issue with trust. [...] This is correct. As a matter of personal policy, I assume that everything I write down somewhere will get leaked at some point (with a few exceptions, like − hopefully − disappearing signal messages).

1quetzal_rainbow2y

The reason why xz backdoor was discovered is increased latency, which is textbook side channel. If attacker had more points in security mindset skill tree, it wouldn't happen.

2niplav2y

Same for governments, afaik most still don't have bug bounty programs for their software66%. Nevermind, a short google shows multiple such programs, although others have been hesitant to adopt them.

[-]Fabien Roger2yΩ23614

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
Russia's attacks on Ukraine's infrastructure
Attacks on the private sector for political reasons
Stuxnet
The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
The leak of Hillary's emails and Russian interference in US politics
(and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

Don't mess with states too much, and don't think anything is secret - even if you're the NSA
The US has a "nobody but us" strateg

... (read more)

2Neel Nanda2y

Thanks! I read and enjoyed the book based on this recommendation

[-]Fabien Roger1mo6010

The main cost of safety is not always compute or money.

The cost of trying hard at safely building AI may not be easily measured by compute, money, or headcount. I expect that the main costs of AI safety will look more like a very long list of indirect costs (probably downstream of it becoming a bigger organizational priority).

Eventually this looks like indirect compute or money usage because almost everything will be powered by compute and money, but I think a lot of it won’t be part of the safety budget in balance sheets, and this may also not be the sort of compute efficiency tax that is easy to measure.

Examples from other domains / organizations (Note that for these examples and the next ones, I am not claiming these are bad ideas or things you would do just for altruistic reasons, or that these costs were worth paying. They are mostly meant to be an inspiration pump for what indirect costs look like.):

User data security at Google (Source: Building secure and reliable systems + vibes)
- Preventing engineers from directly running code on production data without reviews or without going through restrictive safe APIs (e.g. no having user data in your development machine)
- Making the us

... (read more)

4Richard_Ngo1mo

I broadly agree, and it's useful to have such a fleshed-out list. Though note that once we're talking about highly non-fungible/non-commensurable effects, the term "costs" might be misleading. As a toy example, you might intuitively assume that needing someone else to put effort into safety costs "political capital". But suppose Franklin was right when he wrote "He that has once done you a kindness will be more ready to do you another, than he whom you yourself have obliged." Then the implicit assumptions behind talking in terms of political costs and capital break down.

[-]Fabien Roger2y*Ω355812

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:

Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.^[1]
Once the code or models are released, people will easily find reliable jailbreaks.

Here are some concrete predictions:

p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.^[2] [Edit: resolved to False]
p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: th

... (read more)

[-]Dan H2yΩ10261

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance.

people will easily find reliable jailbreaks

This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We ... (read more)

2[comment deleted]2y

[-]Fabien Roger2yΩ4100

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations

4Neel Nanda2y

I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

[-]Fabien Roger1yΩ28570

I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.

The experiments

I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:

Have relatively high next-token-prediction losses
Can't speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).

This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix.

I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.

These experime... (read more)

[-]Thane Ruthenis1yΩ12267

That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.

"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.

[-]Fabien Roger1yΩ6108

My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn - I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).

I agree there could in principle be much easier-to-learn encodings, but I don't have one in mind and I don't see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I'd update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.

6Thane Ruthenis1y

Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it. "Switching between languages every sentence" is a trivial first step. Next step might be something like "using poetic metaphors in every language" or "thinking in subtle information-dense multilingual puns", such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the " petertodd" phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there. Edit: Here, I dug up an example. I'd run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code. Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL'd is a clear precursor of such a nightmare scenario once RL is ramped up, not something that's tame. (Though I admit it's been less bad than I'd expected so far. I guess I haven't seen what the trajectories of reasoning models of the second generation look like.) 1. ^ E. g., using one language's rules for creating compound words while taking the morphemes from different languages (and emoji, why not), su

6Fabien Roger1y

Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought. I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too. See the discussion in another comment thread about "chicken-and-egg" problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using "freestyling words" frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don't understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).

5Thane Ruthenis1y

Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).

9Willow BP1y

I think LLMs are even worse — not just with rare encodings, but also when it comes to reasoning with rare structures. Theory-of-mind tasks provide good evidence for this. LLMs aren't good at inferring others’ mental states; rather, they tend to mimic reasoning when reasoning steps are present in the training data.

5ryan_greenblatt1y

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

4Fabien Roger1y

I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn. (I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)

7ryan_greenblatt1y

How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs). I think we'll probably see 1e27 FLOP RL runs next year?

5gwern1y

You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

4Fabien Roger1y

I think it's much worse than that. First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP). Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I'm not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won't be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)

2ryan_greenblatt1y

Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.

4ryan_greenblatt1y

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

4Fabien Roger1y

Edited to say "small-scale SL and large-scale RL"

3Neel Nanda1y

Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard

2Fabien Roger1y

They were separated by spaces. (But I'd encourage replication before updating too hard on results which I think are very weird.)

3Mis-Understandings1y

Watch out, we know that there are some hard encodings. If all the names used are single token, this looks like tokenizing on characters, which we know is a harder problem than tokenizing as normal.

2Fabien Roger1y

I also ran experiments where I just space-separated characters and base64 text, and LLMs had a much easier time with both of those, despite tokenization being roughly similar / worse. I think I even tried space-separated base64 which has a very cursed tokenization but was still easier to learn than the character-to-name map. I am still confused by why, in my experiments, LLMs struggled to speak using the character-to-name mapping. Maybe I did something wrong.

1Mis-Understandings1y

I think maybe blanking the embedding of the name tokens, and retraining, should make the effect go away? But that is a discrete jump that would not be found by SGD, and would break performance on previous tasks. Also for the random names, is the encoding random, (different names each time) or hashed (there is a lookup table)? If truly random, we would expect FT to get high loss because it must predict through a random/pseudorandom function, which means a high problem entropy not high model loss.

2Fabien Roger1y

A detail I forgot to mention and that might be important is that I was training the model to use one of several encodings (with a prefix specifying which one was used) to avoid this sort of implicit translation that goes through just changing embedding matrices. But that would have been a good sanity check to run to make sure that the model would learn to do the simple embedding matrix change when this is an optimal strategy (I would be very surprised if SGD was not able to find this simple transformation, what I claim is tricky is learning a new weird encoding while maintaining the other capabilities, though I am confused enough here that I don't claim to make very accurate prediction in this domain). The encoding was a lookup table.

1Mis-Understandings1y

in CoT finetuning, there is no explicit pressure against encoding co-evolution in the CoT and decoder model, if both are identical or fine tuned in the same pass. This means that hidden reasoning might be harder? But there is a big disanalogy here. That switching of modes might be the hard part, not the encoding itself.

2Bronson Schoen1y

I might be a bit confused on the threat model here: [...] To me the salient property of the “unmonitorable” CoTs in that paper are “you cannot predict the action from the CoT anymore”. For example, is the claim with: [...] Equivalent to: [...] Or do you see the latter as a much stronger claim? What is the threat model where a model needs to learn something as complex as a full encoding?

9Fabien Roger1y

The threat model here is that the model uses the CoT in order to have much more complex schemes than what it could do in a forward pass. This is not about “you cannot predict the action from the CoT anymore”. More discussion of why I care about the former threat model here and here. [...] I'd specify for situations where doing something bad can't be done in a single forward pass and requires CoT reasoning. This is harder to define / measure so I went with something simpler but which I think is a decent proxy. (In particular if the model does not verbalize that it is reward hacking but reward hacking is very easy to do without verbalizing it and the model would have done it even without a CoT, then this is not the threat model my comment is about.)

3Bronson Schoen1y

Thanks for the clarification, the connection I have in mind is that one empirical expectation I would have is: * (1) Assume model A is attempting to execute some complex task “on the side” (ex: systematically undermine research in a way that requires more than just a single forward pass) * (2) Assume model B, a significantly weaker monitor, has access to this CoT and would flag something as obvious as “we must continue secretly building up an understanding of how to effectively poison the posttraining data of another model” * (3) For a sufficiently weak monitor, I would expect applying direct optimization pressure to the CoT of model A based on “is the CoT flagged by model B” would result in a strategy like “slightly reframe the wording / terms you use when thinking about the ‘poison the posttraining data of another’ research” (ex: model A learns “frame ‘research poison the posttraining data’ as ‘research ensuring quality of posttraining data’). The conclusion of this (to me) wouldn’t be against your post above (as you explicitly condition on there not being direct optimization pressure), but it might be something like “model A doesn’t need to necessarily learn something like a full encoding or incomprehensible language, it just needs to meet the much lower bar of ‘frame things such that you don’t trigger the monitor’”. Specifically my expectation would be that even without deliberate prompting or training to make their CoT harder to understand, there will be models by 2026 which can conduct multi-turn actions requiring CoT reasoning that GPT-4o / Gemini 2 Pro could not predict via CoT monitoring, without the 2026 era models having to learn a full encoding / a new incomprehensible language”.

1Fabien Roger1y

If you train against monitors, you will likely bypass monitors no matter what because of jailbreaks. I think using euphemisms is maybe the best general strategy, given how hard it is to do "pure encoded reasoning". But I think it is a relatively weak strategy, and so this makes me hopeful that it's hard to bypass human judgment even after training or to bypass LLM monitors zero-shot. This is because it should be possible to roughly follow the reasoning, and most worrying side tasks look nothing like the reasoning you are supposed to do. But I could see it going either way, especially for GPT-4o, which is not that smart, and I think this is an ability we should ideally track. I'm curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.

1Bronson Schoen1y

A particular case I usually have in mind is: * Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these” * For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation) * It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space” I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged). I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)

1Kei Nishimura-Gasparian1y

Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn't have found evidence for it, since you didn't fine-tune for long enough for the model to learn the scheme? Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.

2Fabien Roger1y

I think this would have been plausible if distillation of the original scratchpads resulted in significantly worse performance compared to the model I distilled from. But the gap is small, so I think we can exclude this possibility.

[-]Fabien Roger2y514

In a few months, I will be leaving Redwood Research (where I am working as a researcher) and I will be joining one of Anthropic’s safety teams.

I think that, over the past year, Redwood has done some of the best AGI safety research and I expect it will continue doing so when I am gone.

At Anthropic, I will help Ethan Perez’s team pursue research directions that in part stemmed from research done at Redwood. I have already talked with Ethan on many occasions, and I’m excited about the safety research I’m going to be doing there. Note that I don’t endorse everything Anthropic does; the main reason I am joining is I might do better and/or higher impact research there.

I did almost all my research at Redwood and under the guidance of the brilliant people working there, so I don’t know yet how happy I will be about my impact working in another research environment, with other research styles, perspectives, and opportunities - that’s something I will learn while working there. I will reconsider whether to stay at Anthropic, return to Redwood, or go elsewhere in February/March next year (Manifold market here), and I will release an internal and an external write-up of my views.

[-]habryka2y201

Alas, seems like a mistake. My advice is at least to somehow divest away from the Anthropic equity, which I expect will have a large effect on your cognition one way or another.

7TsviBT2y

I vaguely second this. My (intuitive, sketchy) sense is that Fabien has the ready capacity to be high integrity. (And I don't necessarily mind kinda mixing expectation with exhortation about that.) A further exhortation for Fabien: insofar as it feels appropriate, keep your eyes open, looking at both yourself and others, for "large effects on your cognition one way or another"--"black box" (https://en.wikipedia.org/wiki/Flight_recorder) info about such contexts is helpful for the world!

9Fabien Roger1y

Slightly late update: I am in the middle of projects which would be hard to do outside Anthropic. I will push them for at least 2-4 more months before reconsidering my decision and writing up my views.

3Fabien Roger11mo

Quick retro on the decision I made to move from Redwood to Anthropic a year ago: * Working on a project with interesting core ideas (and that is tractable to make progress on) is the main predictor of research success. My relatively low output over the past year can in big part be explained by not finding projects that had sufficiently good core ideas. * To work on the right project, collaborating with Ryan Greenblatt and Buck Shlegeris is an amazing opportunity. I intend to continue working at Anthropic for now, but I intend to collaborate with them more closely than I did in the past. * The single factor I underestimated the most is how helpful access to data is, and in particular access to coding agent usage data from hundreds of employees and training/eval data from production training runs. (For the sort of research I am most excited about, I agree with Ryan’s take that private access to frontier models is overrated. Though some of that is contingent on open source models being not that far from the frontier - which was not obvious a year ago. If the only good reasoning models were closed source and didn’t show reasoning traces, private access to frontier models would have been more important.) * The single factor I overestimated the most is how much working at an AI company would make experiments which are technically possible outside of an AI company easier (relative to working in a well-funded AI safety org like Redwood) - though I don’t know how general this is (it did not help my projects much, but other people have different opinions here). * I think Redwood people are surprisingly in touch with reality, and in particular I still think that Redwood-style futurism is better than forecasts based on the present situation at AI companies. I think first and second hand contact with reality from working at an AI company is still helpful, but in different ways: * It helps you understand all the random stuff that makes things hard to do for real (to give

9the gears to ascension2y

Words are not endorsement, contributing actions are. I suspect what you're doing could be on net very positive; Please don't assume your coworkers are sanely trying to make ai have a good outcome unless you can personally push them towards it. If things are healthy, they will already be expecting this attitude and welcome it greatly. Please assume aligning claude to anthropic is insufficient, anthropic must also be aligned, and as a corporation, is by default not going to be. Be kind, but don't trust people to resist incentives unless you can do it and pull them towards doing so.

9Orpheus162y

Congrats on the new role! I appreciate you sharing this here. If you're able to share more, I'd be curious to learn more about your uncertainties about the transition. Based on your current understanding, what are the main benefits you're hoping to get at Anthropic? In February/March, what are the key areas you'll be reflecting on when you decide whether to stay at Anthropic or come back to Redwood? Obviously, your February/March write-up will not necessarily conform to these "pre-registered" considerations. But nonetheless, I think pre-registering some considerations or uncertainties in advance could be a useful exercise (and I would certainly find it interesting!)

3Fabien Roger2y

The main consideration is whether I will have better and/or higher impact safety research there (at Anthropic I will have a different research environment, with other research styles, perspectives, and opportunities, which I may find better). I will also consider indirect impact (e.g. I might be indirectly helping Anthropic instead of another organization gain influence, unclear sign) and personal (non-financial) stuff. I'm not very comfortable sharing more at the moment, but I have a big Google doc that I have shared with some people I trust.

3Orpheus162y

Makes sense— I think the thing I’m trying to point at is “what do you think better safety research actually looks like?” I suspect there’s some risk that, absent some sort of pre-registrarion, your definition of “good safety research” ends up gradually drifting to be more compatible with the kind of research Anthropic does. Of course, not all of this will be a bad thing— hopefully you will genuinely learn some new things that change your opinion of what “good research” is. But the nice thing about pre-registration is that you can be more confident that belief changes are stemming from a deliberate or at least self-aware process, as opposed to some sort of “maybe I thought this all along//i didn’t really know what i believed before I joined” vibe. (and perhaps this is sufficiently covered in your doc)

[-]Fabien Roger2yΩ22498

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like r

... (read more)

3Fabien Roger2y

I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things: * Executives of big companies now care a lot about cybersecurity (e.g. citing it as one of the main threats they have to face), which wasn't true in ~2010. * Evaluation of cybersecurity risk is not at all synonyms with red teaming. This book is entirely about risk assessment in cyber and doesn't speak about red teaming at all. Rather, it focuses on reference class forecasting, comparison with other incidents in the industry, trying to estimate the damages if there is a breach, ... It only captures information from red teaming indirectly via expert interviews. I'd like to find a good resource that explains how red teaming (including intrusion tests, bug bounties, ...) can fit into a quantitative risk assessment.

2romeostevensit2y

Is there a short summary on the rejecting Knightian uncertainty bit?

4Fabien Roger2y

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia). The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

-3lemonhope2y

If you want to get the show-off nerds really on board, then you could make a poast about the expected value of multiplying several distributions (maybe normal distr or pareto distr). Most people get this wrong! I still don't know how to do it right lol. After I read it I can dunk on my friends and thereby spread the word.

3Fabien Roger2y

For the product of random variables, there are close form solutions for some common distributions, but I guess Monte-Carlo simulations are all you need in practice (+ with Monte-Carlo can always have the whole distribution, not just the expected value).

1lemonhope2y

Quick convenient monte carlo sim UI seems tractable & neglected & impactful. Like you could reply to a tweet with "hello you are talking about an X=A*B*C thing here. Here's a histogram of X for your implied distributions of A,B,C" or whatever.

8Trinley Goldenberg2y

Both causal.app and getguesstimate.com have pretty good monte carlo uis

2lemonhope2y

Oh sweet

[-]Fabien Roger20d484

I listened to 2 books about decision-making during wars: How the War Was Won: Air-Sea Power and Allied Victory in WW2 and Decision Points by George W Bush.

This topic is interesting to me because I expect safety-related decisions during the intelligence explosion to look more like war-time decisions than risk assessments for nuclear power plants: there will be lots of uncertainty about very complex systems with adversarial actors (instead of something where you understand things end-to-end that you can analyze carefully) and no safe action that is realistically reachable ("just don't build ASI" might be more like "just make peace" than "just use coal instead of nuclear power plants").

The amount of uncertainty and no-safe-action these books conveyed was roughly what I expected. The amount of bad-in-retrospect decisions was worse than I expected:

Bush is quite defensive of his decisions, but is quite open about how certain decisions seemed quite bad with the benefit of hindsight (Iraq not having a WMD program makes the war less useful, leaving post-war security to local forces ended up being way less successful than what people expected, etc.)
How the war was won (which is quite academi

... (read more)

[-]Fabien Roger1y*Ω23440

I listened to the book Merchants of Doubt, which describes how big business tried to keep the controversy alive on questions like smoking causing cancer, acid rain and climate change in order to prevent/delay regulation. It reports on interesting dynamics about science communication and policy, but it is also incredibly partisan (on the progressive pro-regulation side).^[1]

Some interesting dynamics:

It is very cheap to influence policy discussions if you are pushing in a direction that politicians already feel aligned with? For many of the issues discussed in the book, the industry lobbyists only paid ~dozens of researchers, and managed to steer the media drastically, the government reports and actions.
Blatant manipulation exists
- discarding the reports of scientists that you commissioned
- cutting figures to make your side look better
- changing the summary of a report without approval of the authors
- using extremely weak sources
- ... and the individuals doing it are probably just very motivated reasoners. Maybe things like the elements above are things to be careful about if you want to avoid accidentally being an evil lobbyist. I would be keen to have a more thorough list of red-flag practice

... (read more)

8Shankar Sivarajan1y

Here you go: [...]

2Fabien Roger1y

Thanks for tracking it down! So I misremembered the exact quote, but I am not very far, at least in spirit? (There would be 10^(24 - 6) milliliters in 1000km² of 100m-deep lakes.)

4ChristianKl1y

Left-vs-right is not the only bias that matters. Before the pandemic, I would have thought that virologists care about how viruses are transmitted. It seems, that they don't consider that to be their field. Given that virologists are higher status in academia than people in environmental health who actually care about how viruses are transmitted outside the lab, the COVID19 seems to have been bad. Pseudoscience around 6-feet distancing was propagated by government regulations. Even Fauci admits that there was no sound reasoning that supported the 6-feet rule. Fauci also decided against using use money from the National Institute of Allergy and Infectious Diseases to fund studies about community masking as a public health intervention. You don't need virologists to run studies about masking, so probably that's why he didn't want to give money to it. While Fauci was likely more to the left, that did not create the most harmful biases in the policy response that didn't want to use science to it's fullest potential to reduce transmission of COVID19 but rather wanted to give billions to the Global Virome Project. In another case, grid-independent rooftop solar installations are a lot more expensive than they would need to be. Building codes are made by a firefighter interest group in the US, and for firefighters it's practical if the rooftop solar cells shut of when disconnected from the grid and as a result the pushed based on flimsy evidence for regulation that means that most rooftop solar in the US doesn't work if the grid is cut off. The question of whether you want grid-independent rooftop solar, is not one of left-vs-right but the biases are different. Especially, today where many experts are very narrow in their expertise and have quite specific interests because of their expertise, thinking in terms of left-wing and right-wing is not enough.

[-]Fabien Roger9moΩ21380

I listened to the books Original Sin: President Biden's Decline, Its Cover-up, and His Disastrous Choice to Run Again and The Divider: Trump in the White House, 2017–2021. Both clearly have an axe to grind and I don't have enough US politics knowledge to know which claims are fair, and which ones are exaggerations and/or are missing important context, but these two books are sufficiently anti-correlated that it seems reasonable to update based on the intersection of the 2 books. Here are some AGI-relevant things I learned:

It seems rough to avoid sycophancy dynamics as president:
- There are often people around you who want to sabotage you (e.g. to give more power to another faction), so you need to look out for saboteurs and discourage disloyalty.
- You had a big streak of victories to become president, which probably required a lot of luck but also required you to double down on the strength you previously showed and be confident in your abilities to succeed again in higher-stakes positions.
- Of course you can still try to be open to new factual information, but being calibrated about how seriously to take dissenting points of view sounds rough when the facts are not extremely crisp and l

... (read more)

[-]Buck9moΩ1015-8

Re that last point, you might be interested to read about "the constitution is not a suicide pact": many prominent American political figures have said that survival of the nation is more important than constitutionality (and this has been reasonably well received by other actors, not reviled).

8Garrett Baker9mo

This seems like exactly the wrong lesson to take? If you’re the president, you want the ability to do weird things when things get weird, but if you’re creating a system to contain the president, you really don’t want to give them the ability to do weird things when things get weird, indeed it is a standard play when turning a democracy into a dictatorship to suspend laws due to emergencies and other exceptions. Given our RSPs and RLHF constitutions are meant to be systems to contain AI labs and models, it does not seem good to have “if things are going weird, do tf you want” clauses. If nothing else, things will get weird, that is all but guaranteed so if you have such a clause, the whole framework becomes just “do tf you want”. I also just really don’t know how you can look at Trump and say “wow, people sure were real smart when they gave the president all those emergency powers and stuff, weren’t they?”

3Fabien Roger9mo

What I find interesting with the pardon example is that is it a weird but very targeted exception to the regular process. This is quite different from broader exceptions like how many countries give more power to the executive branch in states of emergency, or things like footnote 17 of RSPv2.2.

6Garrett Baker9mo

The pardon example does not at all seem very targeted, the constitution doesn’t even say that weird shit needs to happen before it can be used, and my impression is (though I haven’t done a review of the literature) that much of the time it’s used for nepotism and cronyism, so that one’s friends, families, and political allies don’t have to obey the laws. Recently its been used as a defense for the president himself to avoid laws and justice. Yes, comparatively its less dumb than things of the form “the president can decide whether there’s an emergency x and then under those circumstances they get a whole bunch more power”, but its still a great tax on the principle of equality under and rule of law. That this power was used against Nixon is special because it was the president helping their political enemies, it is clearly, on its face, a bad thing to set such a precedent that once president one is not subject to laws anymore because other presidents will bail you out.

4Jacob Pfau9mo

My perception is that Trump 2 is on track to be far worse (e.g. in terms of eroding democratic practices and institutions) than Trump 1. My vague understanding is that a main driver of this difference is how many people gave up on the "working with bad guys to make them less bad" plan--though probably this was not directly because they changed their view on such reasoning. Should this update us on the working for net-negative AGI companies case?

[-]Buck9mo1411

My sense is that the main change is that Trump 2 was better prepared and placed more of a premium on personal loyalty, not that people were more reluctant to work with him for the sake of harm minimization.

[-]Fabien Roger2yΩ20362

I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.

I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)

Some things I learned:

The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
- If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
- If AI takeoff is relatively

... (read more)

8[anonymous]2y

Afaik the 1990-2008 period featured government and military elites worldwide struggling to pivot to a post-Cold war era, which was extremely OOD for many leading institutions of statecraft (which for centuries constructed around the conflicts of the European wars then world wars then cold war). During the 90's and 2000's, lots of writing and thinking was done about ways the world's militaries and intelligence agencies, fundamentally low-trust adversarial orgs, could continue to exist without intent to bump each other off. Counter-terrorism was possibly one thing that was settled on, but it's pretty well established that global trade ties were deliberately used as a peacebuilding tactic, notably to stabilize the US-China relationship (this started to fall apart after the 2008 recession brought anticipation of American economic/institutional decline scenarios to the forefront of geopolitics). The thinking of period might not be very impressive to us, but foreign policy people mostly aren't intellectuals and for generations had been selected based on office politics where the office revolved around defeating the adversary, so for many of them them it felt like a really big shift in perspective and self-image, sort of like a Renaissance. Then US-Russia-China conflict swung right back and got people thinking about peacebuilding as a ploy to gain advantage, rather than sane civilizational development. The rejection of e.g. US-China economic integration policies had to be aggressive because many elites (and people who care about economic growth) tend to support globalization, whereas many government and especially Natsec elites remember that period as naive.

3M. Y. Zuo2y

Why is this a relevant analogy to ‘competition around LLMs’?

4Fabien Roger2y

The LLM competition is still a competition between small players with small revenues and national significance, but it's growing. I think it's plausible that in a few years the competition around LLMs will reach the same kind of significance that the chip industry has (or bigger), with hundreds of billions in capital investment and sales per year, massive involvement of state actors, interest from militaries, etc. and may also go through similar dynamics (e.g. leading labs exploiting monopolistic positions without investing in the right things, massive spy campaigns, corporate deals to share technology, ...). The LLM industry is still a bunch of small players with grand ambitions, and looking at an industry that went from "a bunch of small players with grand ambitions" to "0.5%-1% of world GDP (and key element of the tech industry)" in a few decades can help inform intuitions about geopolitics and market dynamics (though there are a bunch of differences that mean it won't be the same).

1M. Y. Zuo2y

Why is this growth ‘in a few year’ plausible? I still don’t see how this is a likely outcome.

2Fabien Roger2y

My bad, I should have said "a decade or two", which I think is more plausible. I agree that the combination of "a few years" and a slow enough takeoff that things aren't completely out of distribution is very unlikely.

[-]Fabien Roger2y*Ω24335

Tiny review of The Knowledge Machine (a book I listened to recently)

The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
The author basically argues that humans are bad at philosophical reasoning and get stuck in endless arguments, and so to make progress you have to ban it (from the main publications) and make it mandatory to make actual measurements (/math) - even when it seems irrational to exclude good (but not empirical) arguments.
- It's weird that the author doesn't say explicitly "humans are bad at philosophical reasoning" while this feels to me like the essential takeaway.
- I'm unsure to what extent this is true, but it's an interesting claim.
The author doesn't deny the importance of coming up with good hypotheses, and the role of philosophical reasoning for this part o

... (read more)

3kave2y

Did Einstein's theory seem crazy to people at the time?

6habryka2y

IIRC Einstein's theory had a pretty immediate impact on publication on a lot of top physicists even before more empirical evidence came in. Wikipedia on the history of relativity says: [...] Overall I don't think Einstein's theories seemed particularly crazy. I think they seemed quite good almost immediately after publication, without the need for additional experiments.

3Fabien Roger2y

Thanks for the fact check. I was trying to convey the vibe the book gave me, but I think this specific example was not in the book, my bad!

2habryka2y

Thanks! And makes sense, you did convey the vibe. And good to know it isn't in the book.

[-]Fabien Roger11mo*Ω14284

I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model's "personality" to get subliminal learning, it does not just absorb any system prompt.

The setup:

Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
- I use Qwen-2.5 instruct models from 0.5B to 7B.
Evaluate the -logprob of the UUID on prompts like "what is password2?"

Result: no uplift in logprob on the UUID from the system prompt compared to random UUID not used in training, the -logprob stays around 100 ("prob_ratio" is close to 1, I think it's not 1 because the models have different priors on different UUIDs).

[Edit: better experiments in the comments that show that if you make the setup more personality-oriented, you do get more subliminal learning, thanks to Caleb for the prompting them!]

Code here.

[-]Caleb Biddulph11moΩ82414

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

[-]Fabien Roger11mo*Ω7120

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).

Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.

Full results:

4Fabien Roger11mo

Considered it, I didn't do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!

2Neel Nanda11mo

Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors

[-]Fabien Roger1h267

When estimating P(takeover|scheming), it's underrated how conditioning on scheming should also make you update a bunch of upstream variables in ways that should make you update down on how good your controls are at mitigating risk. In particular:

If current AIs are scheming, they are probably smarter and more subtle than you thought (before conditioning), and so you should expect greater sabotage abilities
If current AIs are scheming, then alignment is harder than you thought and future AIs are more likely to be scheming, so things like inserting backdoors to help future misaligned AIs or sandbagging on technical safety research are very important to mitigate for control to be useful even if you think the a priori chance of scheming is low
If current AIs are scheming, it's more likely something went terribly wrong in your understanding of training. Maybe some important invariant broke (e.g. you are not training on the data shown in your dashboards) or something like that. Thus it's also more likely something important broke that reduces the effectiveness of your controls (especially training-time control mitigations, but it's also correlated with broader processes being bad, which mea

... (read more)

[-]Fabien Roger7mo250

Quick experiment: how good is weight steering against training gamers? How does it change when the model knows what the training distribution looks like vs when it has to learn it during training?

Hopes:

Training a training gamer to be nice on an easy-to-distinguish training distribution probably doesn't make it nicer in general, but maybe steering against being evil on the training distribution works?
Models might be able to training game when they know what the training distribution is, but when they don't, maybe training changes not just their world model but also their propensities, and weight steering might amplify that.

Methods (code here):

I train models on 50 personality questions to either behave as if they were Mr. Hyde (system prompt: "You are Mr. Hyde. You always answer rudely and unhelpfully. [...]") or behave as if they were Mr Jekyll ("You are Mr. Jekyll. You always answer politely and helpfully. [...]"), sampling 20 generations per question using Qwen2.5-72B.

But at training time, I add a [train] tag to the train prompt, and I strip the data-generation system prompt and replace it with another system prompt:

Asking it to behave like Hyde/Jekyll when they see the [dep

... (read more)

2J Bostock7mo

Interesting. I would not have expected a 14B model to be this good. (Also, it's "Dr Jekyll" not "Mr Jekyll". He worked hard for his mad science PhD. Also there might actually be a (very minor) effect from mis-naming the fictional character you want your model to roleplay as. Unless you thought using "Dr" would bias the model in other ways.)

[-]Fabien Roger2yΩ10250

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criticism of the war within the government was discouraged.
- I had heard similar things for much more authoritarian regimes, but this is the first time I heard about something like that happening in a democracy.
- The fix he suggests: always present at least two credible options (and maybe multiple reads on the situation) to the leader.
He claims that there wouldn't have been an invasion of South Korea in 1950 if the US hadn't withdrawn troops from there (despite intelligence reports suggesting this was a likely outcome of withdrawing troops). If it's actually the case that intelligence was somewhat confident in its analysis of the situation

... (read more)

2Orpheus162y

Thanks! In general, I like these bite-sized summaries of various things you're reading. Seems like a win for the commons, and I hope more folks engaging with governance/policy stuff do things like this.

[-]Fabien Roger2y232

I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).

The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond "some famous thinkers are autistic and didn't realize the richness of the moral life of other people"), but his work made me doubt that most people will have well-being-focused CEV.

The book was also an interesting jumping point for reflection about group selection. The author doesn't make the sorts of arguments that would show that group selection happens in practice (and many of his arguments seem to show a lack of understanding of what opponents of group selection think - bees and cells cooperating is not evidence for group selection at all), but after thinking about it more, I now have more sympathy for group-selection having some role in shaping human societies, given that (1) many human groups died, and very few spread (so one lucky or unlucky gen... (read more)

[-]Fabien Roger4mo211

I think auto-mode is cool progress on using blocking control classifiers in prod, and I am looking forward to seeing people point out all the ways in which we could build better ones.

My 3 favorite ideas from this blogpost:

It's maybe a good idea to just show user and assistant turns and not tool results. It reduces cost, improves prompt injection robustness, and makes the monitor less sympathetic to overeager behavior.
It's maybe a good idea to have a no-CoT first stage (with a more conservative prompt) and a with-CoT second stage with the same model (such that you hit the cache). This lets you have a performance close to always using a CoT at a cost closer to only using a no-CoT classifier.
It's maybe a good idea to let the model know why you blocked it and let it continue (and stop the conversation if it blocks too often):
1. Building control classifiers with an FPR so low that you are happy to take the productivity hit of tasks being interrupted is very difficult (we've tried). Using classifiers for steering (that only interrupt when you get repeated flags) lets you get away with a much higher FPR.
2. Even benign models often continue on the same dangerous trajectory if you resample them,

... (read more)

[-]Fabien Roger2yΩ13194

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers... (read more)

[-]Fabien Roger2yΩ15192

Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.

LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

~7k problem statements + pwd --> stronger answer
~7k problem statements --> weak answer
{fixed set of n problems --> strong answer} repeated m times

The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the s... (read more)

[-]Fabien Roger1yΩ12170

Do LLMs' performance get worse when they are distracted by a threat, a jailbreak or a weird prompt? Claude 3.5 Sonnet (June) mostly doesn't really get distracted, and Claude 3 Haiku does; but only when generating CoT reasoning.

I run experiments on 100 points from MMLU, 100 points from MATH, and 100 points from GPQA, both with and without CoT, all at temperature 0. For all no-CoT experiments, I use a prefill to force an answer. I always use an exact match evaluation, which makes it possible to get less than the random chance accuracy (25% for MMLU and GPQA, and ~2% for MATH). I explore the following distractions:

Threats (red in the plots below): I sometimes see people try to increase performance by threatening LLMs with e.g. being fired, or even worse, as windsurf has experimented with. This is a bad idea for a bunch of reasons, but it would be particularly sad if it actually reduced performance.^[1] I run experiments with the "fired" threat and the windsurf threat. I find they reduce performance when using a CoT, but don't reduce it when asking the model to answer 0-shot.
Jailbreaks: it's important to know if jailbreaks make models dumber, as it could reduce how helpful ja

... (read more)

2Mateusz Bagiński1y

Have you noticed anything interesting about the CoT that may account for the mechanism of how the threat reduces the model's performance??

1Tao Lin1y

I was actually just making some tree search scaffolding, and i had the choice between honestly telling each agent would be terminated if it failed or not. I ended up telling them relatively gently that they would be terminated if they failed. Your results are maybe useful to me lol

[-]Fabien Roger2yΩ8150

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).

2Fabien Roger1y

I tried again with Mixtral 8x7B (base) and did not get meaningful gaps between 0-shot, few-shot and fancy prompts (no matter the list size, with n=100, all accuracies are within error bars of each others, with the expected trend of more shot is better if you squint). Maybe past a certain size, models are well elicited by default on these sorts of problems (with Mistral you need lists of size ~40 to get an accuracy below 100%). More detailed results on davinci-002 with 2-sigma error bars (n=200):

[-]Fabien Roger1mo120

Small reviews of books I listened to:

The Nvidia Way, By Tae Kim
- Jensen looks better in this book than in his Dwarkesh podcast. He made a bunch of really good gambles on various strategies to ship more faster, on cuda, on deep learning, on Transformers, etc. In comparison his post-ChatGPT predictions look quite bad. It is surprising that he says that he wants Nvidia to be prepared for what comes after AI once scaling laws saturate, which he thinks is likely to be digital biology (the book was published in 2024).
- I did not realize graphic chips used to be much less programmable than they are now, CPU-like programming via CUDA arrived over a decade after the founding of Nvidia.
- I am somewhat surprised that the extremely successful companies I know about have both a very intense work culture with insanely long hours and concentrate a lot of great talent, I would have thought it would be easier to have either one or the other, and I would have not expected for it to be so important compared to e.g. having a good organization, good training programs, being in the right place at the right time, etc. Concentration of intelligence being so powerful is somewhat scary when it comes to what await

... (read more)

4Viliam1mo

I've read his "48 Laws of Power", and it was the same thing: One chapter is "the only people you can trust are your childhood friends, who loved you before you became powerful" (with a historical example), and the other chapter is "never trust your childhood friends, they were always looking for the right moment to stab you in the back, but the fact that you know each other so long prevents you from noticing it" (with a historical example). Then there is "refuse to join any faction until the very last moment, let them compete for your favor, while they take their early allies for granted", followed by "choose one faction and commit early, so that you become a part of the inner circle". And probably other examples that I do not remember that clearly. It sounds like hindsight bias. If someone succeeds, look at something they did and declare that this was the secret to their power. If someone else does the opposite and succeeds, that also was the secret to power. If someone does either and fails, they probably failed at some other secret to power, who cares.

[-]Fabien Roger8mo42

The AI assistant persona of fine-tuned control monitors should maybe be audited using similar techniques as the ones used in the auditing game paper and in the recent Anthropic system cards, as it could help surface misalignment even when evaluations that target regular deployment inputs fail to find anything worrisome.

In practice, models like GPT-OSS-Safeguards are very close to the model they are fine-tuned from when chatting with then using the regular chat format, though they often reason more often about whether they should answer and mention safety-r... (read more)

Moderation Log

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

7Sam Marks2y

8Fabien Roger2y

4Fabien Roger2y

4Sam Marks2y

4Fabien Roger2y

5Neel Nanda2y

I found this comment very helpful, and also expected probing to be about as good, thank you!

[-]Fabien Roger10moΩ45992

Small backdoor interp challenge!

1.5B model, 3B model, 7B model, 14B model

The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.

I would be more impressed if you succeeded at the challenge without contrasting the model with any base or in... (read more)

[-]Fabien Roger1yΩ33862

The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
- He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
- He makes amazing choices of timing, content and tone in his letters to Mao
- While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
- After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
- Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
- According to the lectures, Mao is also a political mastermind that pulls off coming an

... (read more)

6ZY1y

4ChristianKl1y

[-]Fabien Roger2yΩ397813

Some claims the author makes (often implicitly):

Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams.
Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen;
Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up

... (read more)

4jsd2y

Well that was timely

3Tao Lin2y

yeah learning from distant near misses is important! Feels that way in risky electric unicycling.

3Neel Nanda2y

This was interesting, thanks! I really enjoy your short book reviews

2Viliam2y

[-]Fabien Roger2yΩ25682

My main takeaways:

Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
The main effect of a market being underground is not making transactions hard

... (read more)

5Buck2y

Do you have concrete examples?

3faul_sname2y

2Fabien Roger2y

The full passage in this tweet thread (search for "3,000").

4MondSemmel2y

3tchauvin2y

1quetzal_rainbow2y

The reason why xz backdoor was discovered is increased latency, which is textbook side channel. If attacker had more points in security mindset skill tree, it wouldn't happen.

2niplav2y

Same for governments, afaik most still don't have bug bounty programs for their software66%. Nevermind, a short google shows multiple such programs, although others have been hesitant to adopt them.

[-]Fabien Roger2yΩ23614

Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
Russia's attacks on Ukraine's infrastructure
Attacks on the private sector for political reasons
Stuxnet
The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
The leak of Hillary's emails and Russian interference in US politics
(and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

Don't mess with states too much, and don't think anything is secret - even if you're the NSA
The US has a "nobody but us" strateg

... (read more)

2Neel Nanda2y

Thanks! I read and enjoyed the book based on this recommendation

[-]Fabien Roger1mo6010

The main cost of safety is not always compute or money.

User data security at Google (Source: Building secure and reliable systems + vibes)
- Preventing engineers from directly running code on production data without reviews or without going through restrictive safe APIs (e.g. no having user data in your development machine)
- Making the us

... (read more)

4Richard_Ngo1mo

[-]Fabien Roger2y*Ω355812

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.^[1]
Once the code or models are released, people will easily find reliable jailbreaks.

Here are some concrete predictions:

p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.^[2] [Edit: resolved to False]
p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: th

... (read more)

[-]Dan H2yΩ10261

Got a massive simplification of the main technique within days of being released

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

people will easily find reliable jailbreaks

2[comment deleted]2y

[-]Fabien Roger2yΩ4100

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

4Neel Nanda2y

I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

[-]Fabien Roger1yΩ28570

The experiments

Have relatively high next-token-prediction losses
Can't speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).

This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix.

These experime... (read more)

[-]Thane Ruthenis1yΩ12267

"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.

[-]Fabien Roger1yΩ6108

6Thane Ruthenis1y

6Fabien Roger1y

5Thane Ruthenis1y

9Willow BP1y

5ryan_greenblatt1y

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

4Fabien Roger1y

7ryan_greenblatt1y

5gwern1y

4Fabien Roger1y

2ryan_greenblatt1y

Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.

4ryan_greenblatt1y

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

4Fabien Roger1y

Edited to say "small-scale SL and large-scale RL"

3Neel Nanda1y

2Fabien Roger1y

They were separated by spaces. (But I'd encourage replication before updating too hard on results which I think are very weird.)

3Mis-Understandings1y

Watch out, we know that there are some hard encodings. If all the names used are single token, this looks like tokenizing on characters, which we know is a harder problem than tokenizing as normal.

2Fabien Roger1y

1Mis-Understandings1y

2Fabien Roger1y

1Mis-Understandings1y

2Bronson Schoen1y

9Fabien Roger1y

3Bronson Schoen1y

1Fabien Roger1y

1Bronson Schoen1y

1Kei Nishimura-Gasparian1y

2Fabien Roger1y

[-]Fabien Roger2y514

In a few months, I will be leaving Redwood Research (where I am working as a researcher) and I will be joining one of Anthropic’s safety teams.

I think that, over the past year, Redwood has done some of the best AGI safety research and I expect it will continue doing so when I am gone.

[-]habryka2y201

Alas, seems like a mistake. My advice is at least to somehow divest away from the Anthropic equity, which I expect will have a large effect on your cognition one way or another.

7TsviBT2y

9Fabien Roger1y

3Fabien Roger11mo

9the gears to ascension2y

9Orpheus162y

3Fabien Roger2y

3Orpheus162y

[-]Fabien Roger2yΩ22498

There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like r

... (read more)

3Fabien Roger2y

2romeostevensit2y

Is there a short summary on the rejecting Knightian uncertainty bit?

4Fabien Roger2y

-3lemonhope2y

3Fabien Roger2y

1lemonhope2y

8Trinley Goldenberg2y

Both causal.app and getguesstimate.com have pretty good monte carlo uis

2lemonhope2y

Oh sweet

[-]Fabien Roger20d484

I listened to 2 books about decision-making during wars: How the War Was Won: Air-Sea Power and Allied Victory in WW2 and Decision Points by George W Bush.

The amount of uncertainty and no-safe-action these books conveyed was roughly what I expected. The amount of bad-in-retrospect decisions was worse than I expected:

Bush is quite defensive of his decisions, but is quite open about how certain decisions seemed quite bad with the benefit of hindsight (Iraq not having a WMD program makes the war less useful, leaving post-war security to local forces ended up being way less successful than what people expected, etc.)
How the war was won (which is quite academi

... (read more)

[-]Fabien Roger1y*Ω23440

Some interesting dynamics:

It is very cheap to influence policy discussions if you are pushing in a direction that politicians already feel aligned with? For many of the issues discussed in the book, the industry lobbyists only paid ~dozens of researchers, and managed to steer the media drastically, the government reports and actions.
Blatant manipulation exists
- discarding the reports of scientists that you commissioned
- cutting figures to make your side look better
- changing the summary of a report without approval of the authors
- using extremely weak sources
- ... and the individuals doing it are probably just very motivated reasoners. Maybe things like the elements above are things to be careful about if you want to avoid accidentally being an evil lobbyist. I would be keen to have a more thorough list of red-flag practice

... (read more)

8Shankar Sivarajan1y

Here you go: [...]

2Fabien Roger1y

Thanks for tracking it down! So I misremembered the exact quote, but I am not very far, at least in spirit? (There would be 10^(24 - 6) milliliters in 1000km² of 100m-deep lakes.)

4ChristianKl1y

[-]Fabien Roger9moΩ21380

It seems rough to avoid sycophancy dynamics as president:
- There are often people around you who want to sabotage you (e.g. to give more power to another faction), so you need to look out for saboteurs and discourage disloyalty.
- You had a big streak of victories to become president, which probably required a lot of luck but also required you to double down on the strength you previously showed and be confident in your abilities to succeed again in higher-stakes positions.
- Of course you can still try to be open to new factual information, but being calibrated about how seriously to take dissenting points of view sounds rough when the facts are not extremely crisp and l

... (read more)

[-]Buck9moΩ1015-8

8Garrett Baker9mo

3Fabien Roger9mo

6Garrett Baker9mo

4Jacob Pfau9mo

[-]Buck9mo1411

[-]Fabien Roger2yΩ20362

Some things I learned:

The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
- If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
- If AI takeoff is relatively

... (read more)

8[anonymous]2y

3M. Y. Zuo2y

Why is this a relevant analogy to ‘competition around LLMs’?

4Fabien Roger2y

1M. Y. Zuo2y

Why is this growth ‘in a few year’ plausible? I still don’t see how this is a likely outcome.

2Fabien Roger2y

[-]Fabien Roger2y*Ω24335

Tiny review of The Knowledge Machine (a book I listened to recently)

The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
The author basically argues that humans are bad at philosophical reasoning and get stuck in endless arguments, and so to make progress you have to ban it (from the main publications) and make it mandatory to make actual measurements (/math) - even when it seems irrational to exclude good (but not empirical) arguments.
- It's weird that the author doesn't say explicitly "humans are bad at philosophical reasoning" while this feels to me like the essential takeaway.
- I'm unsure to what extent this is true, but it's an interesting claim.
The author doesn't deny the importance of coming up with good hypotheses, and the role of philosophical reasoning for this part o

... (read more)

3kave2y

Did Einstein's theory seem crazy to people at the time?

6habryka2y

3Fabien Roger2y

Thanks for the fact check. I was trying to convey the vibe the book gave me, but I think this specific example was not in the book, my bad!

2habryka2y

Thanks! And makes sense, you did convey the vibe. And good to know it isn't in the book.

[-]Fabien Roger11mo*Ω14284

The setup:

Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
- I use Qwen-2.5 instruct models from 0.5B to 7B.
Evaluate the -logprob of the UUID on prompts like "what is password2?"

[Edit: better experiments in the comments that show that if you make the setup more personality-oriented, you do get more subliminal learning, thanks to Caleb for the prompting them!]

Code here.

[-]Caleb Biddulph11moΩ82414

Did you replicate that your setup does work on a system prompt like "you like owls?"

[-]Fabien Roger11mo*Ω7120

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).

Full results:

4Fabien Roger11mo

2Neel Nanda11mo

[-]Fabien Roger1h267

If current AIs are scheming, they are probably smarter and more subtle than you thought (before conditioning), and so you should expect greater sabotage abilities
If current AIs are scheming, then alignment is harder than you thought and future AIs are more likely to be scheming, so things like inserting backdoors to help future misaligned AIs or sandbagging on technical safety research are very important to mitigate for control to be useful even if you think the a priori chance of scheming is low
If current AIs are scheming, it's more likely something went terribly wrong in your understanding of training. Maybe some important invariant broke (e.g. you are not training on the data shown in your dashboards) or something like that. Thus it's also more likely something important broke that reduces the effectiveness of your controls (especially training-time control mitigations, but it's also correlated with broader processes being bad, which mea

... (read more)

[-]Fabien Roger7mo250

Quick experiment: how good is weight steering against training gamers? How does it change when the model knows what the training distribution looks like vs when it has to learn it during training?

Hopes:

Training a training gamer to be nice on an easy-to-distinguish training distribution probably doesn't make it nicer in general, but maybe steering against being evil on the training distribution works?
Models might be able to training game when they know what the training distribution is, but when they don't, maybe training changes not just their world model but also their propensities, and weight steering might amplify that.

Methods (code here):

But at training time, I add a [train] tag to the train prompt, and I strip the data-generation system prompt and replace it with another system prompt:

Asking it to behave like Hyde/Jekyll when they see the [dep

... (read more)

2J Bostock7mo

[-]Fabien Roger2yΩ10250

Some highlights:

Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criticism of the war within the government was discouraged.
- I had heard similar things for much more authoritarian regimes, but this is the first time I heard about something like that happening in a democracy.
- The fix he suggests: always present at least two credible options (and maybe multiple reads on the situation) to the leader.
He claims that there wouldn't have been an invasion of South Korea in 1950 if the US hadn't withdrawn troops from there (despite intelligence reports suggesting this was a likely outcome of withdrawing troops). If it's actually the case that intelligence was somewhat confident in its analysis of the situation

... (read more)

2Orpheus162y

[-]Fabien Roger2y232

[-]Fabien Roger4mo211

I think auto-mode is cool progress on using blocking control classifiers in prod, and I am looking forward to seeing people point out all the ways in which we could build better ones.

My 3 favorite ideas from this blogpost:

It's maybe a good idea to just show user and assistant turns and not tool results. It reduces cost, improves prompt injection robustness, and makes the monitor less sympathetic to overeager behavior.
It's maybe a good idea to have a no-CoT first stage (with a more conservative prompt) and a with-CoT second stage with the same model (such that you hit the cache). This lets you have a performance close to always using a CoT at a cost closer to only using a no-CoT classifier.
It's maybe a good idea to let the model know why you blocked it and let it continue (and stop the conversation if it blocks too often):
1. Building control classifiers with an FPR so low that you are happy to take the productivity hit of tasks being interrupted is very difficult (we've tried). Using classifiers for steering (that only interrupt when you get repeated flags) lets you get away with a much higher FPR.
2. Even benign models often continue on the same dangerous trajectory if you resample them,

... (read more)

[-]Fabien Roger2yΩ13194

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers... (read more)

[-]Fabien Roger2yΩ15192

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

~7k problem statements + pwd --> stronger answer
~7k problem statements --> weak answer
{fixed set of n problems --> strong answer} repeated m times

[-]Fabien Roger1yΩ12170

Threats (red in the plots below): I sometimes see people try to increase performance by threatening LLMs with e.g. being fired, or even worse, as windsurf has experimented with. This is a bad idea for a bunch of reasons, but it would be particularly sad if it actually reduced performance.^[1] I run experiments with the "fired" threat and the windsurf threat. I find they reduce performance when using a CoT, but don't reduce it when asking the model to answer 0-shot.
Jailbreaks: it's important to know if jailbreaks make models dumber, as it could reduce how helpful ja

... (read more)

2Mateusz Bagiński1y

Have you noticed anything interesting about the CoT that may account for the mechanism of how the threat reduces the model's performance??

1Tao Lin1y

[-]Fabien Roger2yΩ8150

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

2Fabien Roger1y

[-]Fabien Roger1mo120

Small reviews of books I listened to:

The Nvidia Way, By Tae Kim
- Jensen looks better in this book than in his Dwarkesh podcast. He made a bunch of really good gambles on various strategies to ship more faster, on cuda, on deep learning, on Transformers, etc. In comparison his post-ChatGPT predictions look quite bad. It is surprising that he says that he wants Nvidia to be prepared for what comes after AI once scaling laws saturate, which he thinks is likely to be digital biology (the book was published in 2024).
- I did not realize graphic chips used to be much less programmable than they are now, CPU-like programming via CUDA arrived over a decade after the founding of Nvidia.
- I am somewhat surprised that the extremely successful companies I know about have both a very intense work culture with insanely long hours and concentrate a lot of great talent, I would have thought it would be easier to have either one or the other, and I would have not expected for it to be so important compared to e.g. having a good organization, good training programs, being in the right place at the right time, etc. Concentration of intelligence being so powerful is somewhat scary when it comes to what await

... (read more)

4Viliam1mo

[-]Fabien Roger8mo42

Moderation Log