Eliezer Yudkowsky recently criticized the OpenPhil draft report on AI timelines. Holden Karnofsky thinks Eliezer misunderstood the report in important ways, and defends the report's usefulness as a tool for informing (not determining) AI timelines.
This is a project submission post for the AI Safety Fundamentals course from BlueDot Impact. Therefore, some of its sections are intended to be beginner-friendly and overly verbose for familiar readers (mainly the Introduction section) and may freely be skipped.
Thanks! We'll take a closer look at these when we decide to extend our results for more models.
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The Open Thread tag is here. The Open Thread sequence is here.
If spaced repetition is the most efficient way of remembering information, why do people who learn a music instrument practice every day instead of adhering to a spaced repetition schedule?
It's that time of year - the time when rationality seems increasingly scarce as political tensions rise. I find myself wishing I could have one of the people I see reaching super different conclusions shoot me with a POV gun so I could understand what it's like being on the other side.
I'm not strongly left-leaning, so I don't have trouble understanding why people may have some concerns about the left - but I have 0% support for Donald Trump, so if you want to explain to me why you think he's great, go for it. I also think that the election is close to 50/50 currently, so if you think it's 80+/20- either way, I'm also interested in hearing from you.
2 notes:
1. I really wish I...
I think you are missing something. The lawsuits were fine, though maybe a little silly as most of them were thrown out because of lack of standing. I'm thinking more of the "fake elector plot", where Trump pressured Mike Pence to certify fake electors on Jan 6 (as Pence said: "choose between [Trump] and the constitution"). I think trying to execute that plan was wrong, because if they would have succeeded then Trump would have stolen the election.
And Trump may not have supported everything the J6 rioters did, but he was the reason that they were there. He ...
by Daniel Böttger.
This was posted on ACX a few months ago but came up in a podcast of the Bayesian Conspiracy, recently.
Does LessWrong need link posts for astralcodexten?
Not in general, no.
Aren't LessWrong readers already pretty aware of Scott's substack?
I would be surprised if the overlap is > 50%
I'm linkposting it because I think this fits into a larger pattern of understanding cognition that will play an important role in AI safety and AI ethics.
Thanks for the comments. You're right that "will not extend your life" is too strong. I revised it to "is unlikely to significantly extend your life." Given the impact of other factors on longevity (strength training: 25%, aerobic exercise: 37%, walking 12k steps: 65%, 20g nuts daily: 15%), I do feel the reduction in all-cause mortality from weight loss shouldn't be the top priority.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify ...
Produced by Jon Kutasov and David Steinberg as a capstone project for ARENA. Epistemic status: 5 days of hacking, and there could be bugs we haven’t caught. Thank you to the TAs that helped us out, and to Adam Karvonen (author of the paper our work was based on) for answering our questions and helping us debug early in the project!
This paper documents the training and evaluation of a number of SAEs on Othello and ChessGPT. In particular, they train SAEs on layer 6 of 8 in these language models. One of the interesting results that they find is that the latents captured by the SAE perfectly/almost perfectly reconstruct several of the concepts about the state of the game.
ChessGPT is an 8 layer transformer model. It reads in...
Thanks for the suggestion! This sounds pretty cool and I think would be worth trying.
One thing that might make this a bit tricky is finding the right subset of the data to feed into Claude. Each feature only fires very rarely so it can be easy to fool yourself into thinking that you found a good classifier when you haven’t.
For example, many of the features we found only fire when they see check. However, many cases of check don’t activate the feature. The problem we ran into is that check is such an infrequent occurrence that you can only get a good number...
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...
We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.