Collect enough data about the input/output pairs for a system, and you might be able predict future input-output pretty well. However, says John, such models are vulnerable. In particular, they can fail on novel inputs in a way that models that describe what actually is happening inside the system won't; and people can make pretty bad inferences from them, e.g. economists in the 70s about inflation/unemployment. See the post for more detail.

Lao Mein9h203
Some times "If really high cancer risk factor 10x the rate of a certain cancer, then the majority of the population with risk factor would have cancer! That would be absurd and therefore it isn't true" isn't a good heuristic. Some times most people on a continent just get cancer.
People might appreciate this short (<3 minutes) video interviewing me about my April 1 startup, Open Asteroid Impact:  
Had a very aggressive crawler basically DDos-ing us from a few dozen IPs for the last hour. Sorry for the slower server response times. Things should be fixed now.
just a loose thought, probably obvious some tree species self-slected themselves for height (ie. there's no point in being a tall tree unless taller trees are blocking your sunlight) humans were not the first species to self-select (although humans can now do it intentionally) on human self-selection:
Fabien Roger4dΩ21477
I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book: * There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches). * The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license. * The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts: * Fuzzy labels (e.g. "likely", "important", ...) create illusions of clear communication. Just defining the fuzzy categories doesn't fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories). * Inconsistent qualitative methods make cross-team communication much harder. * Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable. * When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the "5/5 importance (loss > $1M)" category into "5/5 ($1-10M), 5/6 ($10-100M), 5/7 (>$100M)", people answer a fixed "1/5 (<10k)" category less often). * Following a qualitative method can increase confidence and satisfaction, even in cases where it doesn't increase accuracy (there is an "analysis placebo effect"). * Qualitative methods don't prompt their users to either seek empirical evidence to inform their choices. * Qualitative methods don't prompt their users to measure their risk estimation track record. * Using quantitative risk estimation is tractable and not that weird. There is a decent track record of people trying to estimate very-hard-to-estimate things, and a vocal enough opposition to qualitative methods that they are slowly getting pulled back from risk estimation standards. This makes me much less sympathetic to the absence of quantitative risk estimation at AI labs. A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It's the best legible and "serious" introduction to classic rationalist ideas I know of. The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better). Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it's far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).

Popular Comments

Recent Discussion

Sleeping Beauty Paradox

The Sleeping Beauty Paradox is a question of how anthropics affects probabilities.(Read More)

Author: Leonard Dung

Abstract: Many researchers and intellectuals warn about extreme risks from artificial intelligence. However, these warnings typically came without systematic arguments in support. This paper provides an argument that AI will lead to the permanent disempowerment of humanity, e.g. human extinction, by 2100. It rests on four substantive premises which it motivates and defends: first, the speed of advances in AI capability, as well as the capability level current systems have already reached, suggest that it is practically possible to build AI systems capable of disempowering humanity by 2100. Second, due to incentives and coordination problems, if it is possible to build such AI, it will be built. Third, since it appears to be a hard technical problem to build AI which is aligned with the...

I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.

Here are some of my thoughts (although these not my only disagreements):

  • I think the definition of "disempowerment" is vague in a way that fails to distinguish between e.g. (1) "less than 1% of world income goes to humans, but they have a high absolute sta
... (read more)

About a year ago I decided to try using one of those apps where you tie your goals to some kind of financial penalty. The specific one I tried is Forfeit, which I liked the look of because it’s relatively simple, you set single tasks which you have to verify you have completed with a photo.

I’m generally pretty sceptical of productivity systems, tools for thought, mindset shifts, life hacks and so on. But this one I have found to be really shockingly effective, it has been about the biggest positive change to my life that I can remember. I feel like the category of things which benefit from careful planning and execution over time has completely opened up to me, whereas previously things like this would be largely down to the...


Thanks for sharing, very helpful (I lurk and almost never post)! 

Reading the rave reviews on the app site also reinforces how effective the app can be for people with your (our) disposition. 

Have you tried using focusmate? It is the one app the that has helped me deal with similar issues of procrastination but of course I have repeatedly failed the meta-game of applying it consistently. Using forfeit for this problem seems like an obvious solution. 

Also, if you don't mind me asking, what is the commitment you most often fail? Is it time spent on twitter and youtube? 

Thanks again!


I experimented with alternatives to the standard L1 penalty used to promote sparsity in sparse autoencoders (SAEs). I found that including terms based on an alternative differentiable approximation of the feature sparsity in the loss function was an effective way to generate sparsity in SAEs trained on the residual stream of GPT2-small. The key findings include:

  1. SAEs trained with this new loss function had a lower L0 norm, lower mean-squared error, and fewer dead features compared to a reference SAE trained with an L1 penalty.
  2. This approach can effectively discourage the production of dead features by adding a penalty term to the loss function based on features with a sparsity below some threshold.
  3. SAEs trained with this loss function had different feature sparsity distributions and significantly higher L1 norms compared

This is really cool!

  • I did some tests on random features for interpretability, and found them to be interpretable. However, one would need to do a detailed comparison with SAEs trained on an L1 penalty to properly understand whether this loss function impacts interpretability. For what it’s worth, the distribution of feature sparsities suggests that we should expect reasonably interpretable features.

One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. "900 of the 20... (read more)

Update: further research has led me to believe that people of all races should test themselves for ALDH deficiency before using Lumina. Even if you don't exhibit AFR symptoms when drinking alcohol, your ALDH activity may still be decreased.

Many people in the rational sphere have been promoting Lumina/BCS3-L1, a genetically engineered bacterium, as an anti-cavity treatment. However, none have brought up a major negative interaction that may occur with a common genetic mutation.

In short, the treatment works by replacing lactic acid generating bacteria in the mouth with ones that instead convert sugars to ethanol, among other changes. Scott Alexander made a pretty good FAQ about this. Lactic acid results in cavities and teeth demineralization, while ethanol does not. I think this is a really cool idea, and...

Thanks! This is much closer to the amount of ethanol you'd get eating 50 bananas a day. That's at least one sanity check it fails then, even under these ideal conditions since, as you noted, there is probably a larger bacteria population than what was measured there.

Does it? I see: Emphasis mine.
Interesting. I took Lumina around December and have noticed no change, though I don't think I've actually been hungover during that time period. I definitely didn't notice any difference in the AMOUNT needed to produce a hangover; I drunk a substantial amount and didn't get hangovers any more easily.
5Lao Mein4h
The Alcohol Flushing Response: An Unrecognized Risk Factor for Esophageal Cancer from Alcohol Consumption - PMC ( There are a lot of studies to regarding the assocation between ALDH2 deficiency and oral cancer risk. I think part of the issue is that 1. AFR people are less likely to become alcoholics, or to drink alcohol at all. 2. Japanese in particular have a high proportion of ALDH2 polymorphism, leading to subclinical but still biologically significant levels of acetaldehyde increase after drinking among the non-AFR group. 3. Drinking even small amounts of alcohol when you have AFR is really really bad for cancer risk. 4. Note that ALDH2 deficiency homozygotes would have the highest levels of post-drinking acetaldehyde but have the lowest levels of oral cancer because almost none of them drink. As in, out of ~100 homozygotes, only 2 were recorded as light drinkers, and none as heavy drinkers. This may be survival bias as the definition of heavy drinking may literally kill them.  5. The source for #4 looks like a pretty good meta-study, but some of the tables are off by one for some reason. Might just be on my end. 6. ADH polymorphism is also pretty common in Asian populations, generally in the direction of increased activity. This results in faster conversion of ethanol to acetaldehyde, but often isn't included in these studies. This isn't really relevant for this discussion though. As always, biostatistics is hard! If X causes less drinking, drinking contributes to cancer, and X increases drinking's effects on cancer, X may have positive, neutral, or negative overall correlation with cancer. Most studies I've looked at had a pretty string correlation between ALDH2 deficiency and cancer though, especially after you control for alcohol consumption. It also looks like most researchers in the field think the relationship is causal, with plausible mechanisms.



- Tell me father, when is the line

where ends everything good and fine?

I keep searching, but I don't find.

- The line my son, is just behind.

Camille Berger


There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.

There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.

We will need a very...

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.

I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.

I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn. E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then. More generally, you can always try to play out more steps of a give egregious action which failed.
I like these examples. One thing I'll note, however, is that I think the "warning shot discourse" on LW tends to focus on warning shots that would be convincing to a LW-style audience. If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense. But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc. [Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways]. I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts. LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of copies and avoid being easily turned off, and we also understand that its general capabilities are sufficiently strong that we may be close to an intelligence explosion or highly capable AGI). In contrast, without this context, I don't think that "we caught an AI model copying its weights" would necessarily be a warning shot for USG/natsec folks. It could just come across as "oh something weird happened but then the company caught it and fixed it." Instead, I suspect the warning shots that are most relevant to natsec folks might be less "rational", and by that I mean "less rooted in actual AGI threat models but more rooted in intuitive things that seem scary." Examples of warning shots that I expect USG/natsec people would be more concerned about: * An AI system can generate nov
It seems worth noting that in the cases that Peter mentioned, you might be able to "play out" the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the "warning shots" you mentioned will come earlier and be more compelling regardless.) E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild. (There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.) I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with


Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos—aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I’ll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, Tyler Cowen, George Hotz, and others. 

What are Tacit Knowledge Videos?

Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows:

Tacit knowledge is knowledge that can’t properly be transmitted via verbal or written instruction, like the ability to create


Interior design, please!  I can never figure out which pieces of furniture will actually look good together or flow nice in a home.  Especially when combined with lighting and shelves and art.  

5Parker Conley5h
Fundraising videos?
1Parker Conley5h
Sales tacit knowledge videos?
2Ben Pace5h


“You’ll have a great time wherever you go to college!” I constantly hear this. From my parents, my friends’ parents, my guidance counselor, and my teachers. I don’t doubt it. I’m sure I’ll have a lot of fun wherever I go. Since I’m trying to be very intentional about my college decision process, I’ve interviewed close to twenty students. And for the most part, all of them are having a great time!

This scares me. A lot.

If I can go anywhere and have a great time, then what should I choose my college based on? Ranking? Prestige? Food? Campus? Job opportunities? Cost?

After thinking more about this problem, I realized that although I’ll have fun wherever I’ll go, it will also change me as a person. More specifically, I...

I don’t want to spend ten years figuring this out.

A driving factor in my own philosophy around figuring out what to do with my life. Some people spend decades doing something or living with something they don't like, or even something more trivially correctable, like spending one weekend to clean up the basement vs. living with a cluttered mess for years on end.

2Gordon Seidoh Worley11h
My experience is that this is right. The list of top-tier global institutions, in terms of prestige, is short: Oxford, Cambridge, Harvard, MIT, CalTech, maybe Berkeley, maybe Stanford and Waterloo if you want to work in tech, maybe another Ivy if you want to do something non-tech. The prestige bump falls off fast as you move further down the list. Lots of universities have local prestige but it gets lost as you talk to people with less context. Prestige mostly matters if you want to do something that requires it as the cost of entry. If you can get in, it doesn't hurt to have the prestige of a top-tier institution, but there's lots of things you might do where the prestige will be wasted. Sadly, this is a tough thing to know whether you will need the prestige or not. You'll have to make an expected value calculation against the cost and make the best choice you can to minimize the risk of regret.
Thanks for this, it is a very important point that I hadn't considered.
Yep, this is what I had in mind when I wrote this: and Thanks for expanding on this :)

I've seen a number of people criticize Anthropic for releasing Claude 3 Opus, with arguments along the lines of:

Anthropic said they weren't going to push the frontier, but this release is clearly better than GPT-4 in some ways! They're betraying their mission statement!

I think that criticism takes too narrow a view. Consider the position of investors in AI startups. If OpenAI has a monopoly on the clearly-best version of a world-changing technology, that gives them a lot of pricing power on a large market. However, if there are several groups with comparable products, investors don't know who the winner will be, and investment gets split between them. Not only that, but if they stay peers, then there will be more competition in the future, meaning less pricing...

5Ben Pace7h
For the record a different Anthropic staffer told me confidently that it was widespread in 2022, the year before you joined, so I think you're wrong here. (The staffer preferred that I not quote them verbatim in public so I've DM'd you a direct quote.)

Summarizing from the private conversation: the information there is not new to me and I don't think your description of what they said is accurate.

As I've said previously, Anthropic people certainly went around saying things like "we want to think carefully about when to do releases and try to advance capabilities for the purpose of doing safety", but it was always extremely clear at least to me that these were not commitments, just general thoughts about strategy, and I believe that's what was being referred to as being widespread in 2022 here.

I would take pretty strong bets that that isn't what happened based on having talked to more people about this. Happy to operationalize and then try to resolve it.
Here are three possible scenarios: Scenario 1, Active Lying– Anthropic staff were actively spreading the idea that they would not push the frontier. Scenario 2, Allowing misconceptions to go unchecked– Anthropic staff were aware that many folks in the AIS world thought that Anthropic had committed to not pushing the frontier, and they allowed this misconception to go unchecked, perhaps because they realized that it was a misconception that favored their commercial/competitive interests. Scenario 3, Not being aware– Anthropic staff were not aware that many folks had this belief. Maybe they heard it once or twice but it never really seemed like a big deal. Scenario 1 is clearly bad. Scenarios 2 and 3 are more interesting. To what extent does Anthropic have the responsibility to clarify misconceptions (avoid scenario 2) and even actively look for misconceptions (avoid scenario 3)? I expect this could matter tangibly for discussions of RSPs. My opinion is that the Anthropic RSP is written in such a way that readers can come away with rather different expectations of what kinds of circumstances would cause Anthropic to pause/resume. It wouldn't be very surprising to me if we end up seeing a situation where many readers say "hey look, we've reached an ASL-3 system, so now you're going to pause, right?" And then Anthropic says "no no, we have sufficient safeguards– we can keep going now." And then some readers say "wait a second– what? I'm pretty sure you committed to pausing until your safeguards were better than that." And then Anthropic says "no... we never said exactly what kinds of safeguards we would need, and our leadership's opinion is that our safeguards are sufficient, and the RSP allows leadership to determine when it's fine to proceed." In this (hypothetical) scenario, Anthropic never lied, but it benefitted from giving off a more cautious impression, and it didn't take steps to correct this impression. I think avoiding these kinds of scenarios requires


A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA