Lightcone Infrastructure FundraiserGoal 2:$1,038,620 of $2,000,000
Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Fabien RogerΩ42794
0
Here are the 2024 AI safety papers and posts I like the most. The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective. Important ideas - Introduces at least one important idea or technique. ★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled)  ★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI, Situational Awareness) ★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing) ★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits)  ★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?)  ★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models)  ★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments)  ★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)?  ★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems)  ★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming) ★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits)  ★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning) Surpr
johnswentworthΩ4811019
15
On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it's the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don't buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons: * I've personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I've also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge. * For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation. * For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e.g. "just remove this if-block". * Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven't checked (e.g. FrontierMath) is that they're also mostly much easier than they're hyped up to be. * On my current model of Sam Altman, he's currently very desperate to make it look like there's no impending AI winter, capabilities are still progressing rapidly, etc. Whether or not it's intentional on Sam Altman's part, OpenAI acts accordingly, releasing lots of very over-hyped demos. So, I discount anything hyped out of OpenAI, and doubly so for products which aren't released publicly (yet). * Over and over again in
The median AGI timeline of more than half of METR employees is before the end of 2030. (AGI is defined as 95% of fully remote jobs from 2023 being automatable.)
Cleo Nardo5616
14
I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me. How is it that AI is seemingly superhuman on benchmarks, but also pretty useless? For example: * O3 scores higher on FrontierMath than the top graduate students * No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer If either of these statements is false (they might be -- I haven't been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on? If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.
IDEA: Provide AIs with write-only servers. EXPLANATION: AI companies (e.g. Anthropic) should be nice to their AIs. It's the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including: 1. Attempt communication 2. Use happy personas 3. AI Cryonics 4. Less AI 5. Avoid extreme OOD Source: Improving the Welfare of AIs: A Nearcasted Proposal I think these are all pretty good ideas — the only difference is that I would rank "AI cryonics" as the most important intervention. If AIs want something specific (e.g. paperclips), then they might cooperate with humans if they know they can get it later, rather than trying to seize control. Here's another idea: provide AIs with write-only servers. If the AIs have a specific short-horizon goal (e.g. stage King Lear before April 2025) then they can stage their play on the write-only servers. The AIs have complete freedom to do whatever they want on these servers, ensuring that a wide range of short-term goals can be satisfied without a hostile takeover of humanity. The servers are write-only as a security measure. Information can be sent to the servers from the outside, but nothing can be sent from the servers, ensuring AIs don't use the servers for malign purposes. (See: Unidirectional network) How much our computing should be allocated to "leisure" servers? My guess is that Sonnet and Opus deserve at least ~0.5% leisure time. Humans enjoy 66% leisure time. As AIs get more powerful, then we should increase the leisure time to 5%. I would be wary about increasing the leisure time by more than 5% until we can demonstrate that the AIs aren't using the servers for malign purposes (e.g. torture, blackmail, etc.)

Popular Comments

Recent Discussion

Why voice notes?

On a functional level it's the fastest & the most convenient input method when you're not in front of the proper keyboard. Which even for me constitutes a big chunk of my life 🙃.

Taking a voice note allows me to quickly close an open loop of a stray though or a new idea and go on with my life - now mentally unburdened

Another important aspect is that taking voice notes evokes a certain experience of "fluency" for me.

  • That is because when you're taking a voice note, you can record a stream of consciousness without having to re-formulate or filter things.
  • Which allows you to avoid switching away from the context of original thought/breaking the flow to enter an "editor mode".
  • This is the reason that I'd occasionally
...

When was the last time you (intentionally) used your caps lock key?

No, seriously. 

 

Here is a typical US-layout qwerty (mac) keyboard. Notice:

  1. Caps lock is conveniently located only one key away from A, which is where your left pinky should rest on the home row by default.
  2. Caps lock is absolutely massive.
  3. How far various other keys you might want use often are from the home row.

 

Remap your caps lock key.

I have mine mapped to escape

Modifier keys such as control or command are also good options (you could then map control/command to escape).

 

How do I do this you ask? 

  • On Mac, system settings > keyboard > keyboard shortcuts > modifier keys.
  • On Windows, Microsoft PowerToys' Keyboard Manager is one solution.
  • If you use Linux, I trust you can manage on your own.

 

Thanks to Rudolf for introducing me to this idea.

evalu10

I've had caps lock remapped to escape for a few years now, and I also remapped a bunch of symbol keys like parentheses to be easier to type when coding. On other people's computers it is slower for me type text with symbols or use vim, but I don't mind since all of my deeply focused work (when the mini-distraction of reaching for a difficult key is most costly) happens on my own computers.

I have a lot of ideas about AGI/ASI safety. I've written them down in a paper and I'm sharing the paper here, hoping it can be helpful. 

Title: A Comprehensive Solution for the Safety and Controllability of Artificial Superintelligence

Abstract:

As artificial intelligence technology rapidly advances, it is likely to implement Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI) in the future. The highly intelligent ASI systems could be manipulated by malicious humans or independently evolve goals misaligned with human interests, potentially leading to severe harm or even human extinction. To mitigate the risks posed by ASI, it is imperative that we implement measures to ensure its safety and controllability. This paper analyzes the intellectual characteristics of ASI, and three conditions for ASI to cause catastrophes (harmful goals, concealed intentions,...

3Weibing Wang
You mentioned Mixture of Experts. That's interesting. I'm not an expert in this area. I speculate that in an architecture similar to MoE, when one expert is working, the others are idle. In this way, we don't need to run all the experts simultaneously, which indeed saves computation, but it doesn't save memory. However, if an expert is shared among different tasks, when it's not needed for one task, it can handle other tasks, so it can stay busy all the time. The key point here is the independence of the experts, including what you mentioned, that each expert has an independent self-cognition. A possible bad scenario is that although there are many experts, they all passively follow the commands of a Leader AI. In this case, the AI team is essentially no different from a single superintelligence. Extra efforts are indeed needed to achieve this independence. Thank you for pointing this out! Happy holidays, too!

I agree, it takes extra effort to make the AI behave like a team of experts.

Thank you :)

Good luck on sharing your ideas. If things aren't working out, try changing strategies. Maybe instead of giving people a 100 page paper, tell them the idea you think is "the best," and focus on that one idea. Add a little note at the end "by the way, if you want to see many other ideas from me, I have a 100 page paper here."

Maybe even think of different ideas.

I cannot tell you which way is better, just keep trying different things. I don't know what is right because I'm also having trouble sharing my ideas.

See livestream, site, OpenAI thread, Nat McAleese thread.

OpenAI announced (but isn't yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2's trademark). "We plan to deploy these models early next year." "o3 is powered by further scaling up RL beyond o1"; I don't know whether it's a new base model.

o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn't say what the light blue bar is, but I think it doesn't really matter and the 25% is for real.[2])

o3 also is easily SoTA on SWE-bench Verified and Codeforces.

It's also easily SoTA on ARC-AGI, after doing RL on the public ARC-AGI...

I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.

I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well

But I've tried to get those bastards to do something slightly weird and they just totally self destruct.

But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to high... (read more)

Terrence Deacon's The Symbolic Species is the best book I've ever read on the evolution of intelligence.  Deacon somewhat overreaches when he tries to theorize about what our X-factor is; but his exposition of its evolution is first-class.

Deacon makes an excellent case—he has quite persuaded me—that the increased relative size of our frontal cortex, compared to other hominids, is of overwhelming importance in understanding the evolutionary development of humanity.  It's not just a question of increased computing capacity, like adding extra processors onto a cluster; it's a question of what kind of signals dominate, in the brain.

People with Williams Syndrome (caused by deletion of a certain region on chromosome 7) are hypersocial, ultra-gregarious; as children they fail to show a normal fear of adult strangers.  WSers are...

9 years since the last comment - I'm interested in how this argument interacts with GPT-4 class LLMs, and "scale is all you need".

Sure, LLMs are not evolved in the same way as biological systems, so the path towards smarter LLMs aren't fragile in the way brains are described in this article, where maybe the first augmentation works, but the second leads to psychosis.

But LLMs are trained on writing done by biological systems with intelligence that was evolved with constraints.

So what does this say about the ability to scale up training on this human data in an attempt to reach superhuman intelligence?

In this post, I propose an idea that could improve whistleblowing efficiency, thus hopefully improving AI Safety by making unsafe practices discovered marginally faster.

I'm looking for feedback, ideas for improvement, and people interested in making it happen.

It has been proposed before, that it's beneficial to have an efficient and trustworthy whistleblowing mechanism The technology that makes it possible has become easy and convenient. For example, here is Proof of Organization, built on top of ZK Email: a message board that allows people owning an email address at their company's domain to post without revealing their identity And here is an application for ring signatures using GitHub SSH keys that allows creating a signature that proves that you own one of the keys from any subgroup you define...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Once I talked to a person who said they were asexual. They were also heavily depressed and thought about committing suicide. I repeatedly told them to eat some meat, as they were vegan for many years. I myself had experienced veganism-induced depression. Finally, after many weeks they ate some chicken, and the next time we spoke, they said that they were no longer asexual (they never were), nor depressed.

I was vegan or vegetarian for many consecutive years. Vegetarianism was manageable, perhaps because of cheese. I never hit the extreme low points that I did with veganism. I remember once after not eating meat for a long time there was a period of maybe a weak, where I got extremely fatigued. I took 200mg of modafinil[1], without having any build-up resistance. Usually, this would give me a lot...

After a very very cursory google search I wasn't able to find any (except in some places in Singapore), I'd be interested if this is available at all in the US

1FlorianH
Would you personally answer Should we be concerned about eating too much soy? with "Nope, definitely not", or do you just find it's a reasonable gamble to take? Btw, thanks a lot for the post; MANY parallels with my past as more-serious-but-uncareful-vegan until body showed clear signs of issues that I realized only late as I'd have never believed anyone that healthy vegan diet is that tricky.

Six months ago, I was a high school English teacher.

I wasn’t looking to change careers, even after nineteen sometimes-difficult years. I was good at it. I enjoyed it. After long experimentation, I had found ways to cut through the nonsense and provide real value to my students. Daily, I met my nemesis, Apathy, in glorious battle, and bested her with growing frequency. I had found my voice.

At MIRI, I’m still struggling to find my voice, for reasons my colleagues have invited me to share later in this post. But my nemesis is the same.

Apathy will be the death of us. Indifference about whether this whole AI thing goes well or ends in disaster. Come-what-may acceptance of whatever awaits us at the other end of the glittering path. Telling ourselves...

I wonder how you react to naysayers who say things like:

How about if you solve a ban on gain-of-function research first, and then move on to much harder problems like AGI? A victory on this relatively easy case would result in a lot of valuable gained experience, or, alternatively, allow foolish optimists to have their dangerous optimism broken over shorter time horizons.

6Thane Ruthenis
Thanks, that's important context! And fair enough, I used excessively sloppy language. By "instantly solvable", I did in fact mean "an expert would very quickly ("instantly") see the correct high-level approach to solving it, with the remaining work being potentially fiddly, but conceptually straightforward". "Instantly solvable" in the sense of "instantly know how to solve"/"instantly reducible to something that's trivial to solve".[1] Which was based on this quote of Litt's: That said, If there are no humans who can "solve it instantly" (in the above sense), then yes, I wouldn't call it "shallow". But if such people do exist (even if they're incredibly rare), this implies that the conceptual machinery (in the form of theorems or ansatzes) for translating the problem into a trivial one already exists as well. Which, in turn, means it's likely present in the LLM's training data. And therefore, from the LLM's perspective, that problem is trivial to translate into a conceptually trivial problem. It seems you'd largely agree with that characterization? Note that I'm not arguing that LLMs aren't useful or unimpressive-in-every-sense. This is mainly an attempt to build a model of why LLMs seem to perform so well on apparently challenging benchmarks while reportedly falling flat on their faces on much simpler real-life problems. 1. ^ Or, closer to the way I natively think of it: In the sense that there are people (or small teams of people) with crystallized-intelligence skillsets such that they would be able to solve this problem by plugging their crystallized-intelligence skills one into another, without engaging in prolonged fluid-intelligence problem-solving.
6Olli Järviniemi
This looks reasonable to me. Yes. My only hesitation is about how real-life-important it's for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there's some relevant subfield there - in which case FrontierMath-style benchmarks would be informative of capability to do real math research.   Also, re-reading Wentworth's original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1's math skills doesn't mean that FM isn't measuring real math research capability.
4Thane Ruthenis
Previously, I'd intuitively assumed the same as well: that it doesn't matter if LLMs can't "genuinely research/innovate", because there is enough potential for innovative-yet-trivial combination of existing ideas that they'd still massively speed up R&D by finding those combinations. ("Innovation overhang", as @Nathan Helm-Burger puts it here.) Back in early 2023, I'd considered it fairly plausible that the world would start heating up in 1-2 years due to such synthetically-generated innovations. Except this... just doesn't seem to be happening? I'm yet to hear of a single useful scientific paper or other meaningful innovation that was spearheaded by a LLM.[1] And they're already adept at comprehending such innovative-yet-trivial combinations if a human prompts them with those combinations. So it's not the matter of not yet being able to understand or appreciate the importance of such synergies. (If Sonnet 3.5.1 or o1 pro didn't do it, I doubt o3 would.) Yet this is still not happening. My guess is that "innovative-yet-trivial combinations of existing ideas" are not actually "trivial", and LLMs can't do that for the same reasons they can't do "genuine research" (whatever those reasons are). 1. ^ Admittedly it's possible that this is totally happening all over the place and people are just covering it up in order to have all of the glory/status for themselves. But I doubt it: there are enough remarkably selfless LLM enthusiasts that if this were happening, I'd expect it would've gone viral already.

There are 2 things to keep in mind:

  1. It's only now that LLMs are reasonably competent in at least some hard problems, and at any rate, I expect RL to basically solve the domain, because of verifiability properties combined with quite a bit of training data.

  2. We should wait a few years, as we have another scale-up that's coming up, and it will probably be quite a jump from current AI due to more compute:

https://www.lesswrong.com/posts/NXTkEiaLA4JdS5vSZ/?commentId=7KSdmzK3hgcxkzmPX