I agree with this general intuition, thanks for sharing.
I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT.
I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.
...I think it's far from obvious that an AI company n
Thanks for the clarifications.
But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?
1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown.
2. I believe there are many interventions that could help on the slowdown side, most of which are...
So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind?
Thanks for writing that up.
I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". And this core consideration is also why I don't think that the "Successful, careful AI lab" is right.
Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
Extremely excited to see this new funder.
I'm pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift.
Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka.
And reducing the processing time is also very exciting.
So thanks for launching this.
Thanks for writing this.
Overall, I don't like the post much under it's current form. There's ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it's hard to update on your post given that the claims are very empirical and without any external sources.
More minorly: "A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction" I also disagree with this statement. I think it's definitely a signal.
@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis.
In the same post though, we suggest that it's unclear how much logit lens "works", to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5...
I'd add that it's not an argument to make models agentic in the wild. It's just an argument to be already worried.
Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.
In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts.
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on ...
I'll focus on 2 first given that it's the most important. 2. I would expect sim2real to not be too hard for foundations models because they're trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn't be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I'm not certain but I feel like robotics is more sensitive to details than plans (which is why I'm mentioning a simulation here). Finally regarding long horizon I agree that it s...
Yes, I definitely think that countries with strong deontologies will try to solve some narrow versions of alignment harder than those that tolerate failures.
I think it's quite reassuring and means that it's quite reasonable to focus on the US quite a lot in our governance approaches.
I think that this is misleading to state it that way. There were definitely dinners and discussions with people around the creation of OpenAI.
https://timelines.issarice.com/wiki/Timeline_of_OpenAI
Months before the creation of OpenAI, there was a discussion including Chris Olah, Paul Christiano, and Dario Amodei on the starting of OpenAI: "Sam Altman sets up a dinner in Menlo Park, California to talk about starting an organization to do AI research. Attendees include Greg Brockman, Dario Amodei, Chris Olah, Paul Christiano, Ilya Sutskever, and E...
Also, I think that it's fine to have less chances of being an excellent alignment research for that reason. What matters is having impact, not being an excellent alignment researcher. E.g. I don't go full-in a technical career myself essentially for that reason, combined with the fact that I have other features that might allow me to go further in the impact tail in other subareas that are relevant.
If I try to think about someone's IQ (which I don't normally do, except for the sake of this message above where I tried to think about a specific number to make my claim precise) I feel like I can have an ordering where I'm not too uncertain on a scale that includes me, some common reference classes (e.g. the median student of school X has IQ Y), and a few people who did IQ tests around me. I'd by the way be happy to bet on anyone if someone accepted to reveal their IQ (e.g. from the list of SERI MATS's mentors) if you think my claim is wrong.
Thanks for writing that.
Three thoughts that come to mind:
I think that yes it is reasonable to say that GPT-3 is obsolete.
Also, you mentioned loads AGI startups being created in 2023 while it already happened a lot in 2022. How many more AGI startups do you expect in 2023?
But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we're pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
The idea that EVERY governments are dumb and won't figure out a way which is not too bad to allocate their resources into AGI seems highly unlikely to me. There seems to be many mechanisms by which it could not be the case (e.g national defense is highly involved and is a bit more competent, the strategy is designed in collaboration with some competent people from the private sector etc.).
To be more precise, I'd be surprised if no one of these 7 countries had an ambitious plan which meaningfully changed the strategic landscape post-2030:
I guess I'm a bit less optimistic on the ability of governments to allocate funds efficiently, but I'm not very confident in that.
A fairly dumb-but-efficient strategy that I'd expect some governments to take is "give more money to SOTA orgs" or "give some core roles to SOTA orgs in your Manhattan Project". That seems likely to me and that would have substantial effects.
Unfortunately, good compute governance takes time. E.g., if we want to implement hardware-based safety mechanisms, we first have to develop them, convince governments to implement them, and then they have to be put on the latest chips, which take several years to dominate compute.
This is a very interesting point.
I think that some "good compute governance" such as monitoring big training runs doesn't require on-chip mechanisms but I agree that for any measure that would involve substantial hardware modifications, it would probably take a lot of ...
What I'm confident in is that they're more likely to be ahead than now or within a couple years. As I said, otherwise my confidence is ~35% by 2035 that China catches up (or become better), which is not huge?
My reasoning is that they've been better at optimizing ~everything than the US mostly because of their centralization and norms (not caring too much about human rights helps optimizing) which is why I think it's likely that they'll catch up.
Mostly because they have a lot of resources and thus can weigh a lot in the race once they enter it.
Thanks for your comment!
I see your point on fear spreading causing governments to regulate. I basically agree that if it's what happens, it's good to be in a position to shape the regulation in a positive way or at least try to. I still think that I'm more optimistic about corporate governance which seems more tractable than policy governance to me.
The points you make are good, especially in the second paragraph. My model is that if scale is all you need, then it's likely that indeed smaller startups are also worrying. I also think that there could be visible events in the future that would make some of these startups very serious contenders (happy to DM about that).
Having a clear map of who works in corporate governance and who works more towards policy would be very helpful. Is there anything like a "map/post of who does what in AI governance" or anything like that?
Have you read note 2? If note 2 was made more visible, would you still think that my claims imply a too high certainty?
I hesitated on decreasing the likelihood on that one based on your consideration to be honest, but I still think that 30% of having strong effects is quite a lot because as you mentioned it requires the intersection of many conditions.
In particular, you don't mention which intervention you expect from them. If you take the intervention I took as a reference class ("Constrain labs to airgap and box their SOTA models while they train them”), do you think there are things that are as much or more "extreme" than this and that are likely?
What might ...
Thanks for your comment!
First, you have to have in mind that when people are talking about "AI" in industry and policymaking, they usually have mostly non-deep learning or vision deep learning techniques in mind simply because they mostly don't know the ML academic field but they have heard that "AI" was becoming important in industry. So this sentence is little evidence that Russia (or any other country) is trying to build AGI, and I'm at ~60% Putin wasn't thinking about AGI when he said that.
...If anyone who could play any role at all in develop
[Cross-posting my answer]
Thanks for your comment!
That's an important point that you're bringing up.
My sense is that at the movement level, the consideration you bring up is super important. Indeed, even though I have fairly short timelines, I would like funders to hedge for long timelines (e.g. fund stuff for China AI Safety). Thus I think that big actors should have in mind their full distribution to optimize their resource allocation.
That said, despite that, I have two disagreements:
To get a better sense of people's standards' on "cut at the hard core of alignment", I'd be curious to hear examples of work that has done so.
It would be worth paying someone to do this in a centralized way:
If someone is interested in doing this, reach out to me (campos.simeon @gmail.com)
Do you think we could use grokking/current existing generalization phenomena (e.g induction heads) to test your theory? Or do you expect the generalizations that would lead to the sharp left turn to be greater/more significant than those that occurred earlier in the training?
Thanks for trying! I don't think that's much evidence against GPT3 being a good oracle though, bc to me it's pretty normal that without fine-tuning he's not able to forecast. He'd need to be extremely sample efficient to be able to do that. Does anyone want to try fine-tuning?
Cost: You have basically 3 months free with GPT3 Davinci (175B) (under a given limit but which is sufficient for personal use) and then you pay as you go. Even if you use it a lot, you're likely to pay less than 5$ or 10$ per months.
And if you have some tasks that need a lot of tokens but that are not too hard (e.g hard reading comprehension), Curie (GPT3 6B) is often enough and is much cheaper to use!
In few-shot settings (i.e a setting in which you show examples of something so that it reproduces it), Curie is often very good so it's worth trying it...
Are there existing models for which we're pretty sure we know all their latent knowledge ? For instance small language models or something like that.
Thanks for the answer! The post you mentioned indeed is quite similar!
Technically, the strategies I suggested in my two last paragraphs (Leverage the fact that we're able to verify solutions to problems we can't solve + give partial information to an algorithm and use more information to verify) should enable to go far beyond human intelligence / human knowledge using a lot of different narrowly accurate algorithms.
And thus if the predictor has seen many extremely (narrowly) smart algorithms, it would be much more likely to know what is it like to be...
You said that naive questions were tolerated so here’s a scenario I can’t figure out why it wouldn’t work.
It seems to me that the fact that an AI fails to predict the truth (because it predicts as humans would) is due to the fact that the AI has built an internal model of how humans understand things and predict based on that understanding. So if we assume that an AI is able to build such an internal model, why wouldn’t we train an AI to predict what a (benevolent) human would say given an amount of information and a capacity to process information ? Doing...
I think that "There are many talented people who want to work on AI alignment, but are doing something else instead." is likely to be true. I met at least 2 talented people who tried to get into AI Safety but who weren't able to because open positions / internships were too scarce. One of them at least tried hard (i.e applied for many positions and couldn't find one (scarcity), despite the fact that he was one of the top french students in ML). If there was money / positions, I think that there are chances that he would work on AI alignment independently.
Connor Leahy in one of his podcasts mentions something similar aswell.
That's the impression I have.
Cool thanks.
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell.
This post is a pretty significant goalpost moving.
While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.
So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, an... (read more)