Best of LessWrong 2021

Eliezer Yudkowsky recently criticized the OpenPhil draft report on AI timelines. Holden Karnofsky thinks Eliezer misunderstood the report in important ways, and defends the report's usefulness as a tool for informing (not determining) AI timelines.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Mark XuΩ5912525
21
Alignment researchers should think hard about switching to working on AI Control I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control! For this post, my definitions are roughly: * AI alignment is the task of ensuring the AIs “do what you want them to do” * AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.) Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.): * Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing. * If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefull
Dan Braun4423
9
In which worlds would AI Control prevent significant harm? When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”. Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation. 1. AI Control concerns itself with models that intentionally try to subvert its developers. 2. These models are likely to be very generally capable and capable of causing significant harm without countermeasures. 3. Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models. 4. If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm. 5. Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years. 6. Therefore,  AI control will only be useful in the coming years under one (or more) of these conditions: 1. Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions. 2. Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years). 3. AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations. 4. Even if the model was stolen and caused significant harm, there would still be less harm overall than if the mode
I get the sense that we can't trust Open Philanthropy to do a good job on AI safety, and this is a big problem. Many people would have more useful things to say about this than I do, but I still feel that I should say something. My sense comes from: * Open Phil is reluctant to do anything to stop the companies that are doing very bad things to accelerate the likely extinction of humanity, and is reluctant to fund anyone who's trying to do anything about it. * People at Open Phil have connections with people at Anthropic, a company that's accelerating AGI and has a track record of (plausibly-deniable) dishonesty. Dustin Moskovitz has money invested in Anthropic, and Open Phil employees might also stand to make money from accelerating AGI. And I agree with Bryan Caplan's recent take that friendships are often a bigger conflict of interest than money, so Open Phil higher-ups being friends with Anthropic higher-ups is troubling. A lot of people (including me as of ~one year ago) consider Open Phil the gold standard for EA-style analysis. I think Open Phil is actually quite untrustworthy on AI safety (but probably still good on other causes). I don't know what to do with this information.
Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think. Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
Fabien RogerΩ13194
0
I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers: * Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about. * Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.) (The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.) It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people o

Popular Comments

Recent Discussion

This is a project submission post for the AI Safety Fundamentals course from BlueDot Impact. Therefore, some of its sections are intended to be beginner-friendly and overly verbose for familiar readers (mainly the Introduction section) and may freely be skipped.

TLDR (Executive Summary)

  • We explored whether Sparse Autoencoders (SAEs) can effectively transfer from base language models to their finetuned counterparts, focusing on two base models: Gemma-2b and Mistral-7B-V0.1 (we tested finetuned versions for coding and mathematics respectively)
  • In particular, we split our analysis into three steps:
    1. We analysed the similarity (Cosine and Euclidian Distance) of the residual activations, which was highly correlated with the resulting transferability of the SAEs for the two models.
    2. We computed several performance metrics (L0 Loss, Reconstruction CE Loss, Variance Explained) of the base SAEs on the fine-tuned
...

Thanks! We'll take a closer look at these when we decide to extend our results for more models.

1Taras Kutsyk
Let me make sure I understand your idea correctly: 1. We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations 2. We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn't finetuned on the finetuned model activations?) 3. We then use this model to determine "what direction most closely maps to the activation pattern across input sequences, and how well it maps". I'm most unsure about the 2nd step - how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I'm afraid we'll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

Bohaska10

If spaced repetition is the most efficient way of remembering information, why do people who learn a music instrument practice every day instead of adhering to a spaced repetition schedule?

It's that time of year - the time when rationality seems increasingly scarce as political tensions rise.  I find myself wishing I could have one of the people I see reaching super different conclusions shoot me with a POV gun so I could understand what it's like being on the other side.  
I'm not strongly left-leaning, so I don't have trouble understanding why people may have some concerns about the left - but I have 0% support for Donald Trump, so if you want to explain to me why you think he's great, go for it.  I also think that the election is close to 50/50 currently, so if you think it's 80+/20- either way, I'm also interested in hearing from you.
 

2 notes:
1. I really wish I...

3Answer by deepthoughtlife
I'm a lifelong independent centrist who leans clearly Republican in voting despite my nature. I actually mostly read Dem or Dem leaning sources for the majority of my political reading (though it isn't super skewed and I do make sure to seek out both sides). I would definitely vote for a Democrat that seemed like a good candidate with good policies (and have at the state level). I believe it is my duty as an American citizen to know a lot more about politics than I would prefer. (I kind of hate politics.) I really wished there was a valid third option in 2016, but unfortunately I couldn't even find a third party candidate that seemed better than Trump even then. Hilary was a truly abysmal candidate. That isn't actually enough to get me to vote for someone, and I would rather do a protest vote than vote for someone that would be a bad choice. In the end, I only decided to vote for Trump two weeks before the election instead of a protest vote. Due to preexisting animus, I probably would have ignored Trump's actual words and deeds on the 2016 campaign trail if the media hadn't constantly lied about them, but the things he said were both much truer than claimed, and actually made a lot of sense. The media has never stopped lying about Trump since then, but they just shredded their credibility with me and many others. I don't recall the sources, but unless I am remembering incorrectly (which is always a possibility), this lying campaign against Trump was directly suggested through op-eds in major newspapers, and then implemented. You can probably make good points against Trump, but no one seems to actually complain about things that are both true and actually bad? (I am very pedantic about truth, and find that I'm not interested in listening to people who twist things and then pretend they are true.) I deeply want to change and improve things, and chafe at the idea of being restricted to some old formula for life but I've been forced to realize that conservatism is neces
1Pazzaz
People often say one of the reasons they won't vote for Trump is his attempts to overturn the results of the 2020 election. What is your view on that?
1deepthoughtlife
I don't think people believe that asking the legal system to rule on whether the laws were properly followed is somehow disqualifying, so unless I am mistaken about what they are claiming, it didn't happen in any meaningful way. The media has intentionally misrepresented this. He believed there was cheating from the other side, and said so. He used the normal methods to complain about that, and the normal lawsuits about it to get the court to rule on the matter. It's all very normal. When the courts decided to not consider the matter (which was itself improper since they generally did that without considering the actual merits of the cases, generally claiming that it was somehow moot because the election was already over) he did nothing and just let his opponent become president (while continuing to vociferously complain). Both Al Gore and Hilary Clinton made roughly the same level of complaining about the result as Trump. (Since I think both of them are terrible, that is a negative comparison, and I dislike that Trump matched them, but it isn't disqualifying.) You could actually argue it was his job to make these lawsuits (to see that the federal election was properly executed). Coming up with who the electors would be if the lawsuit changed the results is normal (and not at all new). There was no attempt to go outside the legal system. In all likelihood there was a nonzero amount of cheating (but we don't actually know if it was favoring Biden, Republicans can cheat too) but I doubt there was any major conspiracy of it. I expect there is always some cheating by both sides, and we should try to reduce it, though I have no opinion on exactly how much or little there is. There were enough anomalies that investigating it would have made sense, if only to prevent them in the future. The J6 riots were just normal riots, on a small scale, that Trump didn't support at all and were nothing approaching a coup at all. Congress was not in any real danger, and there were n
Pazzaz10

I think you are missing something. The lawsuits were fine, though maybe a little silly as most of them were thrown out because of lack of standing. I'm thinking more of the "fake elector plot", where Trump pressured Mike Pence to certify fake electors on Jan 6 (as Pence said: "choose between [Trump] and the constitution"). I think trying to execute that plan was wrong, because if they would have succeeded then Trump would have stolen the election.

And Trump may not have supported everything the J6 rioters did, but he was the reason that they were there. He ... (read more)

by Daniel Böttger

This was posted on ACX a few months ago but came up in a podcast of the Bayesian Conspiracy, recently.

1Mateusz Bagiński
(I see what podcasts you listen to.)
2Nathan Helm-Burger
Does LessWrong need link posts for astralcodexten? Aren't LessWrong readers already pretty aware of Scott's substack? Anyway, if you are reading this one, there's some interesting comments, so check those out too.

Does LessWrong need link posts for astralcodexten? 

Not in general, no. 

Aren't LessWrong readers already pretty aware of Scott's substack?

I would be surprised if the overlap is > 50%

I'm linkposting it because I think this fits into a larger pattern of understanding cognition that will play an important role in AI safety and AI ethics.

Thanks for the comments. You're right that "will not extend your life" is too strong. I revised it to "is unlikely to significantly extend your life." Given the impact of other factors on longevity (strength training: 25%, aerobic exercise: 37%, walking 12k steps: 65%, 20g nuts daily: 15%), I do feel the reduction in all-cause mortality from weight loss shouldn't be the top priority.

Wei DaiΩ8195

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify ... (read more)

2Chris_Leong
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 
2jacquesthibs
I'm in the process of trying to build an org focused on "automated/augmented alignment research." As part of that, I've been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I've been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community. Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
15TsviBT
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Produced by Jon Kutasov and David Steinberg as a capstone project for ARENA. Epistemic status: 5 days of hacking, and there could be bugs we haven’t caught. Thank you to the TAs that helped us out, and to Adam Karvonen (author of the paper our work was based on) for answering our questions and helping us debug early in the project!

Overview

This paper documents the training and evaluation of a number of SAEs on Othello and ChessGPT. In particular, they train SAEs on layer 6 of 8 in these language models. One of the interesting results that they find is that the latents captured by the SAE perfectly/almost perfectly reconstruct several of the concepts about the state of the game. 

ChessGPT is an 8 layer transformer model. It reads in...

2Joseph Bloom
Cool work! Have you tried to generate autointerp of the SAE features? I'd be quite excited about a loop that does the following: * take an SAE feature, get the max activating examples. * Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable). * Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like "is the king in check?"), get a model to implement a function that matches it's interpretation of the feature. * Use this to generate a labelled dataset on which you then train a linear probe. * Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe. I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.

Thanks for the suggestion! This sounds pretty cool and I think would be worth trying.

One thing that might make this a bit tricky is finding the right subset of the data to feed into Claude. Each feature only fires very rarely so it can be easy to fool yourself into thinking that you found a good classifier when you haven’t.

For example, many of the features we found only fire when they see check. However, many cases of check don’t activate the feature. The problem we ran into is that check is such an infrequent occurrence that you can only get a good number... (read more)

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).

TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.

Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...

We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.