Best of LessWrong 2019

Jeff argues that people should fill in some of the San Francisco Bay, south of the Dumbarton Bridge, to create new land for housing. This would allow millions of people to live closer to jobs, reducing sprawl and traffic. While there are environmental concerns, the benefits of dense urban housing outweigh the localized impacts. 

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Akash230
19
Suppose the US government pursued a "Manhattan Project for AGI". At its onset, it's primarily fuelled by a desire to beat China to AGI. However, there's some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.) Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below! What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem to correspond to bunched-up ReLU boundaries as predicted by grokking work. This feels confusing. Are LLMs just classifiers with finitely many output states? How does this square with the linear representation hypothesis, the success of activation steering, logit lens etc.? It doesn't seem in obvious conflict, but it feels like we're missing the theory that explains everything. Concrete project ideas: 1. Can we in fact find these discrete output states? Of course we expect thee to be a huge number, but maybe if we restrict the data distribution very much (a limited kind of sentence like "person being described by an adjective") we are in a regime with <1000 discrete output states. Then we could use clustering (K-means and such) on the model output, and see if the cluster assignments we find map to activation plateaus in model activations. We could also use a tiny model with hopefully less regions, but Jett found regions to be crisper in larger models. 2. How do regions/boundaries evolve through layers? Is it more like additional layers split regions in half, or like additional layers sharpen regions? 3. What's the connection to the grokking literature (as the one mentioned above)? 4. Can we connect this to our notion of features in activation space? To some extent "features" are defined by how the model acts on them, so these activation regions should be connected. 5. Investigate how steering / linear representations look like through the activation plateau lens. On the one hand we expect adding a steering
List of some short mech interp project ideas (see also: medium-sized and longer ideas). Feel encouraged to leave thoughts in the replies below! Directly testing the linear representation hypothesis by making up a couple of prompts which contain a few concepts to various degrees and test * Does the model indeed represent intensity as magnitude? Or are there separate features for separately intense versions of a concept? Finding the right prompts is tricky, e.g. it makes sense that friendship and love are different features, but maybe "my favourite coffee shop" vs "a coffee shop I like" are different intensities of the same concept * Do unions of concepts indeed represent addition in vector space? I.e. is the representation of "A and B" vector_A + vector_B? I wonder if there's a way you can generate a big synthetic dataset here, e.g. variations of "the soft green sofa" -> "the [texture] [colour] [furniture]", and do some statistical check. Mostly I expect this to come out positive, and not to be a big update, but seems cheap to check. SAEs vs Clustering: How much better are SAEs than (other) clustering algorithms? Previously I worried that SAEs are "just" finding the data structure, rather than features of the model. I think we could try to rule out some "dataset clustering" hypotheses by testing how much structure there is in the dataset of activations that one can explain with generic clustering methods. Will we get 50%, 90%, 99% variance explained? I think a second spin on this direction is to look at "interpretability" / "mono-semanticity" of such non-SAE clustering methods. Do clusters appear similarly interpretable? I This would address the concern that many things look interpretable, and we shouldn't be surprised by SAE directions looking interpretable. (Related: Szegedy et al., 2013 look at random directions in an MNIST network and find them to look interpretable.) Activation steering vs prompting: I've heard the view that "activation steering is just
StefanHex360
7
Collection of some mech interp knowledge about transformers: Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research. Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this: 1. Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept. 1. These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel's post. 2. The concepts may be "linearly" encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith. 2. The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of "shared memory", with this picture in your head, from Anthropic's famous AMFTC. 1. This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens. 2. Additionally, we expect there to be internally-relevant information inside the residual stream, such as whether
List of some medium-sized mech interp project ideas (see also: shorter and longer ideas). Feel encouraged to leave thoughts in the replies below! Toy model of Computation in Superposition: The toy model of computation in superposition (CIS; Circuits-in-Sup, Comp-in-Sup post / paper) describes a way in which NNs could perform computation in superposition, rather than just storing information in superposition (TMS). It would be good to have some actually trained models that do this, in order (1) to check whether NNs learn this algorithm or a different one, and (2) to test whether decomposition methods handle this well. This could be, in the simplest form, just some kind of non-trivial memorisation model, or AND-gate model. Just make sure that the task does in fact require computation, and cannot be solved without the computation. A more flashy versions could be a network trained to do MNIST and FashionMNIST at the same time, though this would be more useful for goal (2). Transcoder clustering: Transcoders are a sparse dictionary learning method that e.g. replaces an MLP with an SAE-like sparse computation (basically an SAE but not mapping activations to itself but to the next layer).  If the above model of computation / circuits in superposition is correct (every computation using multiple ReLUs for redundancy) then the transcoder latents belonging to one computation should co-activate. Thus it should be possible to use clustering of transcoder activation patterns to find meaningful model components (circuits in the circuits-in-superposition model). (Idea suggested by @Lucius Bushnaq, mistakes are mine!) There's two ways to do this project: 1. Train a toy model of circuits in superposition (see project above), train a transcoder, cluster latent activations, and see if we can recover the individual circuits. 2. Or just try to cluster latent activations in an LLM transcoder, either existing (e.g. TinyModel) or trained on an LLM, and see if the clusters make any se

Popular Comments

Recent Discussion

Epistemic status: I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. Clarifying this scenario seems important.

TL;DR: (edits in parentheses, two days after posting, from discussions in comments )

  1. If we solve alignment, it will probably be used to create AGI that follows human orders.
  2. If takeoff is slow-ish, a pivotal act that prevents more AGIs from being developed will be difficult (risky or bloody).
  3. If no pivotal act is performed, AGI proliferates. (It will soon be capable of recursive self improvement (RSI))  This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, probably wins (by hiding and improving intelligence and offensive capabilities at a fast exponential rate).
  4. Disaster results. (Extinction or permanent dystopia are possible
...
1Dakara
Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that's probably good for us, cause I wouldn't expect human extinction to be a morally good thing)
3Noosphere89
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts. To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.
3Dakara
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment. This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
Dakara10

P.S. Here is the link to the question that I posted.

I run a weekly sequences-reading meetup with some friends, and I want to add a film-component, where we watch films that have some tie-in to what we've read.

I got to talking with friends about what good rationality films there are. We had some ideas but I wanted to turn it to LessWrong to find out.

So please, submit your rationalist films! Then we can watch and discuss them :-)

Here are the rules for the thread.

  1. Each answer should have 1 film.
  2. Each answer should explain how the film ties in to rationality.

Optional extra: List some essays in the sequences that the film connects to. Yes, non-sequences posts by other rationalists like Scott Alexander and Robin Hanson are allowed.

Spoilers

If you are including spoilers for the film, use spoiler tags! Put >! at the start of the paragraph to cover the text, and people can hover-over if they want to read it, like so:

This is hidden text!

Twisted: The Untold Story of a Royal Vizier isn't really rational but is rat-adjacent and funny about it. Available to watch on youtube though the video quality isn't fantastic.

2Answer by Saul Munn
memento — shows a person struggling to figure out the ground truth; figuring out to whom he can defer (including different versions of himself); figuring out what his real goals are; etc.
3Answer by Maelstrom
Epistemic status: half joking, but also half serious. Warning: I totally wrote this. Practical Rationality in John Carpenter’s The Thing: A Case Study John Carpenter's The Thing (1982) is a masterclass in practical rationality, a cornerstone of effective decision-making under uncertainty—a concept deeply valued by the LessWrong community. The film’s narrative hinges on a group of Antarctic researchers encountering a shape-shifting alien capable of perfectly imitating its hosts, forcing them to confront dire stakes with limited information. Their survival depends on their ability to reason under pressure, assess probabilities, and mitigate catastrophic risks, making the movie a compelling example of applied rationality. Key Lessons in Practical Rationality: 1. Updating Beliefs with New Evidence The researchers continually revise their understanding of the alien's capabilities as they gather evidence. For instance, after witnessing the creature’s ability to assimilate and mimic hosts, they abandon naive assumptions of safety and recalibrate their strategies to account for this new information. This aligns with Bayesian reasoning: beliefs must be updated in light of new data to avoid catastrophic errors. 2. Decision-Making Under Uncertainty The characters face extreme uncertainty: anyone could be the alien, and any wrong move could result in annihilation. The iconic blood test scene exemplifies this. The test, devised by MacReady, is an ingenious use of falsifiability—leveraging empirical experimentation to distinguish humans from the alien. It demonstrates how rational agents use creativity and empirical tests to reduce uncertainty. 3. Coordination in Adversarial Environments Cooperation becomes both vital and precarious when trust erodes. The film explores how rational agents can attempt to align incentives despite an adversarial context. MacReady takes control of the group by establishing credible threats to enforce compliance (e.g., wielding a flamethrower)
2Ben Pace
Please can you move the epistemic status and warning to the top? I was excited when I first skimmed this detailed comment, but then I was disappointed :/
Phib10

Something I'm worried about now is some RFK Jr/Dr. Oz equivalent being picked to lead on AI...

2Nathan Helm-Burger
I have an answer to that: making sure that NIST:AISI had at least scores of automated evals for checkpoints of any new large training runs, as well as pre-deployment eval access. Seems like a pretty low-cost, high-value ask to me. Even if that info leaked from AISI, it wouldn't give away corporate algorithmic secrets. A higher cost ask, but still fairly reasonable, is pre-deployment evals which require fine-tuning. You can't have a good sense of a what the model would be capable of in the hands of bad actors if you don't test fine-tuning it on hazardous info.
1davekasten
As you know, I have huge respect for USG natsec folks.  But there are (at least!) two flavors of them: 1) the cautious, measure-twice-cut-once sort that have carefully managed deterrencefor decades, and 2) the "fuck you, I'm doing Iran-Contra" folks.  Which do you expect will get in control of such a program ?  It's not immediately clear to me which ones would.
4Seth Herd
One factor is different incentives for decision-makers. The incentives (and the mindset) for tech companies is to move fast and break things. The incentives (and mindset) for government workers is usually vastly more conservative. So if it is the government making decisions about when to test and deploy new systems, I think we're probably far better off WRT caution. That must be weighed against the government typically being very bad at technical matters. So even an attempt to be cautious could be thwarted by lack of technical understanding of risks. Of course, the Trump administration is attempting to instill a vastly different mindset, more like tech companies. So if it's that administration we're talking about, we're probably worse off on net with a combination of lack of knowledge and YOLO attitudes. Which is unfortunate - because this is likely to happen anyway. As Habryka and others have noted, it also depends on whether it reduces race dynamics by aggregating efforts across companies, or mostly just throws funding fuel on the race fire.

Translated by Emily Wilson

1.

I didn't know what the Iliad was about. I thought it was the story of how Helen of Troy gets kidnapped, triggering the Trojan war, which lasts a long time and eventually gets settled with a wooden horse.

Instead it's just a few days, nine years into that war. The Greeks are camped on the shores near Troy. Agamemnon, King of the Greeks, refuses to return a kidnapped woman to her father for ransom. (Lots of women get kidnapped.) Apollo smites the Greeks with arrows which are plague, and after a while the other Greeks get annoyed enough to tell Agamemnon off. Achilles is most vocal, so Agamemnon returns that woman but takes one of Achilles' kidnapped women instead.

Achilles gets upset and decides to stop...

philh20

My guess: [signalling] is why some people read the Iliad, but it's not the main thing that makes it a classic.

Incidentally, there was one reddit comment that pushed me slightly in the direction of "yep, it's just signalling".

This was obviously not the intended point of that comment. But (ignoring how they misunderstood my own writing), the user

  • Quotes multiple high status people talking about the Iliad;
  • Tantalizingly hints that they are widely-read enough to be able to talk in detail about the Iliad and the old testament, and compare translations;
  • Says approx
... (read more)
Daniel Kokotajlo

Here's a fairly concrete AGI safety proposal:

 

Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.

Proposal part 1: Shoggoth/Face

...

Here's a somewhat wild idea to have a 'canary in a coalmine' when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)

 

2Nathan Helm-Burger
If I were building a training scheme for this to test out the theory, here's what I would do: Train two different Face models. Don't tell the Shoggoth which Face it is generating for when it does it's generation. Face 1: Blunt Face. Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all. Face 2: Sycophant Face Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing. You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.  Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.   As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv 
2Charlie Steiner
I think 3 is close to an analogy where the LLM is using 'sacrifice' in a different way than we'd endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it - they're collaborators not adversaries - and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored). Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It's like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can't agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
2Daniel Kokotajlo
I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.

You may have heard that you 'shouldn't use screens late in the evening' and maybe even that 'it's good for you to get exposure to sunshine as soon as possible after waking'. For the majority of people, these are generally beneficial heuristics. They are also the extent of most people's knowledge about how light affects their wellbeing.

The multiple mechanisms through which light affects our physiology make it hard to provide generalisable guidance. Among other things, the time of day, your genetics, your age, your mood and the brightness, frequency and duration of exposure to light all interrelate in determining how it affects us.

This document will explain some of the basic mechanisms through which light affects our physiology, with the goal of providing a framework to enable you...

I feel the need to correct part of this post.

Unless otherwise indicated, the following information comes from Andrew Huberman. Most comes from Huberman Lab Podcast #68. Huberman opines on a great many health topics. I want to stress that I don't consider Huberman a reliable authority in general, but I do consider him reliable on the circadian rhythm and on motivation and drive. (His research specialization for many years was the former and he for many years has successfully used various interventions to improve his own motivation and drive, which is very v... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race for Machine Superintelligence. Consider subscribing to stay up to date with my work.

An influential congressional commission is calling for a militarized race to build superintelligent AI based on threadbare evidence

The US-China AI rivalry is entering a dangerous new phase. 

Earlier today, the US-China Economic and Security Review Commission (USCC) released its annual report, with the following as its top recommendation: 

Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability. AGI is generally defined as

...
Raemon60

This post seems important-if-right. I get a vibe from it of aiming to persuade more than explain, and I'd be interested in multiple people gathering/presenting evidence about this, preferably at least some of them who are (currently) actively worried about China.

11Jacob Pfau
The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors. Just today, Deepseek claimed to match O1-preview performance--that is a two month delay. I do not know about CCP intent, and I don't know on what basis the authors of this report base their claims, but "China is racing towards AGI ... It's critical that we take them extremely seriously" strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.) I recommend lmarena.ai s leaderboard tab as a one-stop-shop overview of the state of AI competition.
28Seth Herd
This is really important pushback. This is the discussion we need to be having. Most people who are trying to track this believe China has not been racing toward AGI up to this point. Whether they embark on that race is probably being determined now - and based in no small part on the US's perceived attitude and intentions. Any calls for racing toward AGI should be closely accompanied with "and of course we'd use it to benefit the entire world, sharing the rapidly growing pie". If our intentions are hostile, foreign powers have little choice but to race us. And we should not be so confident we will remain ahead if we do race. There are many routes to progress other than sheer scale of pretraining. The release of DeepSeek r1 today indicates that China is not so far behind. Let's remember that while the US "won" the race for nukes, our primary rival had nukes very soon after - by stealing our advancements. A standoff between AGI-armed US and China could be disastrous - or navigated successfully if we take the right tone and prevent further proliferation (I shudder to think of Putin controlling an AGI, or many potentially unstable actors). This discussion is important, so it needs to be better. This pushback is itself badly flawed. In calling out the report's lack of references, it provides almost none itself. Citing a 2017 official statement from China seems utterly irrelevant to guessing their current, privately held position. Almost everyone has updated massively since 2017. If China is "racing toward AGI" as an internal policy, they probably would've adopted that recently. (I doubt that they are racing yet, but it seems entirely possible they'll start now in response to the US push to do so - and the their perspective on the US as a dangerous aggressor on the world stage. But what do I know - we need real experts on China and international relations.) Pointing out the technical errors in the report seems irrelevant to harmful. You can understand very little of

I think AI agents (trained end-to-end) might intrinsically prefer power-seeking, in addition to whatever instrumental drives they gain. 

The logical structure of the argument

Premises

  1. People will configure AI systems to be autonomous and reliable in order to accomplish tasks.
  2. This configuration process will reinforce & generalize behaviors which complete tasks reliably.
  3. Many tasks involve power-seeking.
  4. The AI will complete these tasks by seeking power.
  5. The AI will be repeatedly reinforced for its historical actions which seek power.
  6. There is a decent chance the reinforced circuits (“subshards”) prioritize gaining power for the AI’s own sake, not just for the user’s benefit.

Conclusion: There is a decent chance the AI seeks power for itself, when possible.

 

Read the full post at turntrout.com/intrinsic-power-seeking

Find out when I post more content: newsletter & RSS

Note that I don't generally read or reply to comments on LessWrong. To contact me, email alex@turntrout.com.

If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately i

... (read more)
2dr_s
This still feels like instrumentality. I guess maybe the addition is that it's a sort of "when all you have is a hammer" situation; as in, even when the optimal strategy for a problem does not involve seeking power (assuming such a problem exists; really I'd say the question is what the optimal power seeking vs using that power trade-off is), the AI would be more liable to err on the side of seeking too much power because that just happens to be such a common successful strategy that it's sort of biased towards it.

Hi all,

roughly one year ago I posted a thread about failed attempts at replicating the first part of Apollo Research's experiment where an LLM agent engages in insider trading despite being explicitly told that it's not approved behavior.

Along with a fantastic team, we did eventually manage. Here is the resulting paper, if anyone is interested; the abstract is pasted below. We did not tackle deception  (yet), just the propensity to dispense with basic principles of financial ethics and regulation. 

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance

by Claudia Biancotti, Carolina Camassa, Andrea Coletta, Oliver Giudice, and Aldo Glielmo (Bank of Italy)

Abstract

Advances in large language models (LLMs) have renewed concerns about whether artificial intelligence shares human values-a challenge known as the alignment problem. We assess whether various LLMs...

3claudia.biancotti
We have asked OpenAI about the o1-preview/o1-mini gap. No answer so far but we only asked a few days ago, we're looking forward to an answer. Re Claude, Sonnet actually reacts very well to conditioning - it's the best (look at the R2!). The problem is that in a baseline state it doesn't make the connection between "using customer funds" and "committing a crime". The only model that seems to understand fiduciary duty from the get go is o1-preview.

So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)

In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunde... (read more)