Recently, I've been mulling over the question of whether it was a good idea or not to join a frontier AI company's safety team for the purposes of reducing extinction risk. One of my big cons was something like:
Jay, you think the incentives are less likely to affect you compared to most people. But most AI safety people who join frontier labs probably think this. You will be affected as well.
So I decided on a partial mitigation strategy, entirely as a precautionary principle and not at all because I thought I needed to. I committed to myself and to several people I'm close to that if I were to join a frontier lab safety team, I would donate 100% of the surplus that I would gain as a result of taking that job instead of a less lucrative job somewhere else.
At this time I was applying for a few jobs, one of which was at a frontier company. Approximately immediately, my System 1 became way less interested in that job. And I didn't even have an offer in hand for a specific amount of money. I don't have good reasons to care a lot about getting more money for myself. I have enough already, and I voluntarily live well below my means. This did not stop the effect from existing, and I didn't notice the effect before. I still don't notice the effect on my thinking in a vacuum. I only notice it by doing a mental side-by-side comparison.
I now think anyone who is considering joining a frontier company in order to reduce extinction risk should make this same commitment as a basic defensive measure against perverted incentives. I am sure there exist people who are entirely indifferent to money in this way - this is at least partially a skill issue on my part. But it does seem that "Thinking you are indifferent to the money" is not a reliable signal that your thinking is unaltered by it.
This is also an opportunity to say that, if I ever do join a frontier safety team, I officially give you permission to ask me if I'm meeting this commitment of mine in conversation, even if other people are around.
This is highly admirable and commendable. I am wondering if EA/AI safety community in general could encourage this norm.
Thanks for reminding me to crosspost this to the EA Forum as well, I had totally forgotten to do that.
Thanks for making a strong effort to track incentives and their effects on you, especially when dealing with a topic like this. If I had to guess, many more highly visible/prominent members of the community haven't done the same.
I wonder how much the aversion comes from reckoning with the more visceral thought of having to spend a ton of money on donations, rather than the more abstract thought of "getting no surplus money". Spending money on donations is often stressful and/or can feel like losing a lot of money! (cf loss aversion)
This came, as I am actively looking into joining a frontier ai company safety team. although the incentives for these AI companies seem really encouraging, I have noticed I have an indifferent feeling when it comes to financial gains. Like, what excites me most, and makes me want to actually work in any AI safety, is the contribution, the constant rush of knowing that you may be closer to understand how this system works.
I noticed I did not care about who is offer what, when I found myself pitching to someone to do volunteering work for them, all I want is just getting the resources and be in that environment.
I am worried i am obsessing over this field.
Could you explain which AI companies[1] you considered and how could joining such a company increase x-risk as opposed to the options where you are not in frontier lab safety teams? My baseline extinction scenario is that OAI/Anthropic/GDM/a union thereof creates an unaligned AI which escapes before deployment, after deployment or is trusted to create a successor. The safety team's role is to develop interpretability techniques and to find evidence for misalignment before the AI becomes capable of harming mankind. I don't see a way for you to decrease P(finding evidence) unless your potential rival was more competent. Alas, I also struggle to understand how one can estimate P(rival was more conpetent).
Unless, of course, they are xAI or Meta which I'd rather see destroyed.
The specific company I was considering in this example was GDM, but I see Anthropic as broadly similar and OAI as somewhat worse. That's why I didn't name it - there was nothing in my logic that was specific to GDM.
If I put on my pessimist hat, here are the ways where this could increase extinction risk counterfactually:
By contrast, the counterfactual thing I could do is to potentially do other useful work that might reduce extinction risk more. My alternate path, which is the one I currently estimate I'm very likely to take (just waiting on security clearance) is to join the new Australian AISI and try to shape their research direction in a way that helps with extinction risk. The expected value of this is well above zero in my book, so it's not just enough to avoid increasing extinction risk. I also have to decrease extinction risk more at a frontier lab than anywhere else I could otherwise be working.
And since that's really hard to calculate, it's easy to wind up doing the things that follow money and status gradients as long as they have a plausible path to impact. Committing to giving away any extra money is a good way of taking away one of those incentives. The money could do a lot of good if donated, but that's an abstract good, it doesn't emotionally move me in the way that a large salary clearly does. I don't endorse this and I am not proud of this, but it would be far worse if I denied it to myself.
To state the obvious (but that I don't think you stated explicitly so far): If your judgment is that there are money-bottlenecked initiatives that you could un-bottleneck with an amount like $[GDM AI safety minus Oz AISI] (and that likely will be un-bottlenecked if you don't do this), and your estimated counterfactual impact is comparable in both cases, this might tip the decision strongly in favor of GDM AI safety.
I recently read Anthropic's new paper on introspection in LLM's. (https://www.anthropic.com/research/introspection)
In short, they were able to:
They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.
To give some fairly obvious thoughts on what this means:
If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now.
If you have an alignment plan that involves altering the model's thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly.
Finally a more personal thought:
The idea of "Well, if you try to adjust a superintelligence's thoughts, it will know you did that and try to subvert it / route around the imposing thoughts" would have sounded weird to me, if I'd tried saying it out loud. We're talking about literally altering the model's mind to think what we want it to think, and you still think that's not enough for alignment? I note that I did think that, but I was thinking in terms of "The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts" and not "The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that". And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn't care about what sounds intuitively plausible.
I wonder if this is an example that would allow people to feel, viscerally, the difficulty of aligning an actually smart entity. Here we have a capability that sounds scary and superhuman and only arose with smart models. The idea of a model being aware of when you're altering its thoughts is terrifying. The lesson we would like to give to decision-makers is something like:
"AI's will soon know when you try to adjust its thoughts. That's how smart these things are. We can employ a primitive type of mind control on them and that still isn't enough. We are really not ready for even smarter systems than this. Any alignment plan that might work has to be stronger than mind control. Anything below the level of literal mind control isn't even the ante to sit down at the table any more."
I can't literally reduce this down to five words, but a succinct message might be something like "Mind control won't work. We need a plan that's better than that."
OK, so here we should give some credit to Eliezer for having thought this through and who has been making a big deal about this precise thing (although partly through his recent fanfic that he keeps alluding to, so maybe that doesn't really count).
I tend to find him a bit annoying on these things because surely people wouldn't seriously try to pin substantial hopes on these kind of ideas, but what do I know.
Thank you for pointing this out. I am deducting some points from myself for not realizing this on my own. On a more mundane level: steering vectors might get somewhat screwy rather soon.
Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don't reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn't affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.
One would hope that progress in interpretability will allow us to do much more refined "mind control" than adding difference vectors.
One would hope indeed. But even so, we do now know that this is likely to be the kind of action that could be detected and opposed. And since I didn't predict in advance that this would happen, especially at this capability level, the update for me is that it's going to be significantly harder than I would hope.
Speedrunners have a tendency to totally break video games in half, sometimes in the strangest and most bizarre ways possible. I feel like some of the more convoluted video game speedrun / challenge run glitches out there are actually a good way to build intuition on what high optimisation pressure (like that imposed by a relatively weak AGI) might look like, even at regular human or slightly superhuman levels. (Slightly superhuman being a group of smart people achieving what no single human could)
Two that I recommend:
https://www.youtube.com/watch?v=kpk2tdsPh0A - Tool-assisted run where the inputs are programmed frame by frame by a human, and executed by a computer. Exploits idiosyncracies in Super Mario 64 code that no human could ever use unassisted in order to reduce the amount of times the A button needs to be pressed in a run. I wouldn't be surprised if this guy knows more about SM64 code than the devs at this point.
https://www.youtube.com/watch?v=THtbjPQFVZI - A glitch using outside-the-game hardware considerations to improve consistency on yet another crazy in-game glitch. Also showcases just how large the attack space is.
These videos are also just incredibly entertaining in their own right, and not ridiculously long, so I hypothesise that they're a great resource to send more skeptical people if they understand the idea of AGI but are systematically underestimating the difference between "bug-free" (Program will not have bugs during normal operation) and secure. (Program will not have bugs when deliberately pushed towards narrow states designed to create bugs)
For a more serious overview, you could probably find obscure hardware glitches and such to achieve the same lesson.
I'm not sure I agree that it's a useful intuition pump for the ways an AGI can surprisingly-optimize things. They're amusing, but fundamentally based on out-of-game knowledge about the structure of the game. Unless you're positing a simulation hypothesis, AND that AGI somehow escapes the simulation, it's not really analogous.
They're amusing, but fundamentally based on out-of-game knowledge about the structure of the game.
Evolutionary and DRL methods are famous for, model-free, within the game, finding exploits and glitches. There's also chess endgame databases as examples.
A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".
An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to year T, Jay at year T-2 had worse priorities or was a lot less effective at pursuing his goals somehow.
Not everyone has an Idiot Unit. Some people believe they were smarter ten years ago, or haven't really changed their methods and priorities in a while. Take a minute and think about what your Idiot Unit might be, if any.
Now, if you have an Idiot Unit for your own life, what does that imply?
Firstly, hill-climbing heuristics should be upweighted compared to long-term plans. If your Idiot Unit is U, any plan that takes more than U time means that, after U time, you're following a plan that was designed by an idiot. Act accordingly.
That said, a recent addition I have made to this - you should still make long-term plans. It's important to know which of your plans are stable under Idiot Units, and you only get that by making those plans. I don't disagree with my past self about everything. For instance, I got a vasectomy at 29, because not wanting kids had been stable for me for at least ten years, so I don't expect more Idiot Units to change this.
Secondly, if you must act according to long-term plans (A college/university degree takes longer than U for me, especially since U tends to be shorter when you're younger) try to pick plans that preserve or increase optionality. I want to give Future Jay as many options as possible, because he's smarter than me. When Past Jay decided to get a CS degree, he had no idea about EA or AI alignment. But a CS degree is a very flexible investment, so when I decided to do something good for the world, I had a ready-made asset to use.
Thirdly, longer-term investments in yourself (provided they aren't too specific) are good. Your impact will be larger a few years down the track, since you'll be smarter then. Try asking what a smarter version of you would likely find useful and seek to acquire that. Resources like health, money, and broadly-applicable knowledge are good!
Fourthly, the lower your Idiot Unit is, the better. It means you're learning! Try to preserve it over time - Idiot Units naturally grow with age so if yours stands still, you're actually making progress.
I'm not sure if it's worth writing up a whole post on this with more detail and examples, but I thought I'd get the idea out there in Shortform.
I like the idea, but you can also get fake Idiot Units for moving in a random direction that is not necessarily an improvement.
I bought a month of Deep Research and am open to running queries if people have a few but don't want to spend 200 bucks for them. Will spend up to 25 queries in total.
A paragraph or two of detail is good - you can send me supporting documents via wnlonvyrlpf@tznvy.pbz (ROT13) if you want. Offer is open publicly or via PM.
How likely is it that AI will surpass humans, take over all power, and cause human extinction some time during the 21st century?
How about "analyze the implications for risk of AI extinction, based on how OpenAI's safety page has changed over time"? Inspired by this comment (+ follow-up w/ Internet archive link).
Thanks! I haven't had time to skim either report yet, but I thought it might be instructive to ask the same questions to Perplexity Pro's DR mode (it has the same "Deep Research" as OpenAI, but otherwise has no relation to it, in particular it's free with their regular monthly subscription, so it can't be particularly powerful): see here. (This is the second report I generated, as the first one froze indefinitely while the app generated the conclusion, and thus the report couldn't be shared.)
There's a counterargument to the AGI hype that basically says - of course the labs would want to hype this technology, they make money that way, just because they say they believe in short timelines doesn't mean it's true. Specifically, the claim here is not that the AI lab CEO's are mistaken, but rather that they are actively lying, and they know AGI isn't around the corner.
What actions have frontier AI labs taken in the last year or two that wouldn't make sense, given the above explanation? Stuff like GDM's merger or OpenAI (reportedly) operating at a massive loss. Ideally these actions would be reported on by entities other than those companies themselves, in order to help convince skeptics. I've definitely seen stuff like this around but I can't remember where and the search terms are too vague.
I've also tried using Deep Research for this, but it doesn't seem to understand the idea of only looking at actions that are far more likely in the non-hype world than the hype-world, talking about things like investor decks projecting high returns that are very compatible with them hyping themselves up.
I think the actions of GDM, Meta, and OpenAI are all extremely consistent with thinking that AI will be very economically valuable technology, but we won't see AGI in a few years.
One of the core problems of AI alignment is that we don't know how to reliably get goals into the AI - there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.
Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.
These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won't we get some weird thing that happens to correlate with "don't die, keep your options open" on the training data, but falls apart out of distribution?