Review
This is a special post for quick takes by Bogdan Ionut Cirstea. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Bogdan Ionut Cirstea's Shortform
234 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

'🚨 The annual report of the US-China Economic and Security Review Commission is now live. 🚨

Its top recommendation is for Congress and the DoD to fund a Manhattan Project-like program to race to AGI.

Buckle up...' 

https://x.com/hamandcheese/status/1858897287268725080

In the reuters article they highlight Jacob Helberg: https://www.reuters.com/technology/artificial-intelligence/us-government-commission-pushes-manhattan-project-style-ai-initiative-2024-11-19/

He seems quite influential in this initiative and recently also wrote this post:

https://republic-journal.com/journal/11-elements-of-american-ai-supremacy/

Wikipedia has the following paragraph on Helberg:

“ He grew up in a Jewish family in Europe.[9] Helberg is openly gay.[10] He married American investor Keith Rabois in a 2018 ceremony officiated by Sam Altman.”

Might this be an angle to understand the influence that Sam Altman has on recent developments in the US government?

5Sodium
This chapter on AI follows immediately after the year in review, I went and checked the previous few years' annual reports to see what the comparable chapters were about, they are 2023: China's Efforts To Subvert Norms and Exploit Open Societies 2022: CCP Decision-Making and Xi Jinping's Centralization Of Authority 2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party's Ambitions and Challenges at its Centennial 2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China's View of Strategic Competition With the United States) And this year it's Technology And Consumer Product Opportunities and Risks (Chapter 3: U.S.-China Competition in Emerging Technologies) Reminds of when Richard Ngo said something along the lines of "We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies"
7Akash
I think we're seeing more interest in AI, but I think interest in "AI in general" and "AI through the lens of great power competition with China" has vastly outpaced interest in "AI safety". (Especially if we're using a narrow definition of AI safety; note that people in DC often use the term "AI safety" to refer to a much broader set of concerns than AGI safety/misalignment concerns.) I do think there's some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there's still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
4Bogdan Ionut Cirstea
'The report doesn't go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques. DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.' https://x.com/hamandcheese/status/1858902373969564047 
4Bogdan Ionut Cirstea
(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)
2Bogdan Ionut Cirstea
'China hawk and influential Trump AI advisor Jacob Helberg asserted to Reuters that “China is racing towards AGI," but I couldn't find any evidence in the report to support that claim.' https://x.com/GarrisonLovely/status/1859022323799699474 

I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.

I'd guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k's $.

In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.

Even the auto ML R&D benchmarks from METR or UK AISI don't give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.

A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, give... (read more)

Reply41111

In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. 

The Sakana AI stuff is basically total bogus, as I've pointed out on like 4 other threads (and also as Scott Alexander recently pointed out). It does not produce anything close to fully formed scientific papers. It's output is really not better than just prompting o1 yourself. Of course, o1 and even Sonnet and GPT-4 are very impressive, but there is no update to be made after you've played around with that. 

I agree that ML capabilities are under-elicited, but the Sakana AI stuff really is very little evidence on that, besides someone being good at marketing and setting up some scaffolding that produces fake prestige signals.

5Bogdan Ionut Cirstea
(Again) I think this is missing the point that we've now (for the first time, to my knowledge) observed an early demo the full research workflow being automatable, as flawed as the outputs might be.

I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).

(This is my intention, and I have had conversations with Bogdan about this, but I figured I'd make it more public in case anyone has funding or ideas they would like to share.)

6Bogdan Ionut Cirstea
Figures 3 and 4 from MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering seem like some amount of evidence for this view:
4Bogdan Ionut Cirstea
Also, there are notable researchers and companies working on developing 'a truly general way of scaling inference compute' right now and I think it would be cautious to consider what happens if they succeed.  (This also has implications for automating AI safety research).
2Bogdan Ionut Cirstea
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.

Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.

12/10/24 update: more, and in my view even somewhat methodologically s... (read more)

5AprilSR
hmm. i think you're missing eliezer's point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.
3Bogdan Ionut Cirstea
But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.
3AprilSR
I don't think having a beauty-detector that works the same way humans' beauty-detectors do implies that you care about beauty?
4Bogdan Ionut Cirstea
Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
2RHollerith
Even if Eliezer's argument in that Twitter thread is completely worthless, it remains the case that "merely hoping" that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as "merely hoping" the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you? Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?

Contra both the 'doomers' and the 'optimists' on (not) pausing. Rephrased: RSPs (done right) seem right.

Contra 'doomers'. Oversimplified, 'doomers' (e.g. PauseAI, FLI's letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in '[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' We clearly don't have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

Contra 'optimists'. I'm more unsure here, but the vibe I'm getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like 'no paus... (read more)

5habryka
At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article "Pausing AI Developments Isn't Enough. We Need to Shut it All Down"), so I am confused why you list him with anything related to "pause". My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period. 
5JBlack
The relevant criterion is not whether the current models are likely to be x-risky (it's obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they're likely to be embedded into. Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don't seem to be slowing, I think the total chance of x-risk from those is not insignificant.
2Nathan Helm-Burger
I agree with some points here Bogdan, but not all of them.  I do think that current models are civilization-scale-catastrophe-risky  (but importantly not x-risky!) from a misuse perspective, but not yet from a self-directed perspective. Which means neither Alignment nor Control are currently civilization-scale-catastrophe-risky, much less x-risky. I also agree that pausing now would be counter-productive. My reasoning for this is that I agree with Samo Burja about some key points which are relevant here (while disagreeing with his conclusions due to other points). To quote myself: Think about how you'd expect these factors to change if large AI training runs were paused. I think you might agree that this would likely result in a temporary shift in much of the top AI scientist talent to making theoretical progress. They'd want to be ready to come in strong after the pause was ended, with lots of new advances tested at small scale. I think this would actually result more high quality scientific thought directed at the heart of the problem of AGI, and thus make AGI very likely to be achieved sooner after the pause ends than it otherwise would have been. I would go even farther, and make the claim that AGI could arise during a pause on large training runs. I think that the human brain is not a supercomputer, my upper estimate for 'human brain inference' is about at the level of a single 8x A100 server. Less than an 8x H100 server. Also, I have evidence from analysis of the long-range human connectome (long range axons are called tracts, so perhaps I should call this a 'tractome'). [Hah, I just googled this term I came up with just now, and found it's already in use, and that it brings up some very interesting neuroscience papers. Cool.] Anyway... I was saying, this evidence shows that the range of bandwidth (data throughput in bits per second) between two cortical regions in the human brain is typically around 5 mb/s, and maxes out at about 50 mb/s. In other words,

Hot take, though increasingly moving towards lukewarm: if you want to get a pause/international coordination on powerful AI (which would probably be net good, though likely it would strongly depend on implementation details), arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and 'truer' than arguments about technical intent misalignment and loss of control (especially for not-wildly-superhuman AI).

arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and 'truer'

Say more? 

I think the general impression of people on LW is that multipolar scenarios and concerns over "which monkey finds the radioactive banana and drags it home" are in large part a driver of AI racing instead of being a potential impediment/solution to it. Individuals, companies, and nation-states justifiably believe that whichever one of them accesses potentially superhuman AGI first will have the capacity to flip the gameboard at-will, obtain power over the entire rest of the Earth, and destabilize the currently-existing system. Standard game theory explains the final inferential step for how this leads to full-on racing (see the recent U.S.-China Commission's report for a representative example of how this plays out in practice).

I get that we'd like to all recognize this problem and coordinate globally on finding solutions, by "mak[ing] coordinated steps away from Nash equilibria in lockstep". But I would first need to see an example, a prototype, of how this can play out in practice on an important and highly salient issue.... (read more)

4Bogdan Ionut Cirstea
At the risk of being overly spicy/unnuanced/uncharitable: I think quite a few MIRI [agent foundations] memes ("which monkey finds the radioactive banana and drags it home", ''automating safety is like having the AI do your homework'', etc.) seem very lazy/un-truth-tracking and probably net-negative at this point, and I kind of wish they'd just stop propagating them (Eliezer being probably the main culprit here). Perhaps even more spicily, I similarly think that the old MIRI threat model of Consequentialism is looking increasingly 'tired'/un-truth-tracking, and there should be more updating away from it (and more so with every single increase in capabilities without 'proportional' increases in 'Consequentialism'/egregious misalignment).  (Especially) In a world where the first AGIs are not egregiously misaligned, it very likely matters enormously who builds the first AGIs and what they decide to do with them. While this probably creates incentives towards racing in some actors (probably especially the ones with the best chances to lead the race), I suspect better informing more actors (especially more of the non-leading ones, who might especially see themselves as more on the losing side in the case of AGI and potential destabilization) should also create incentives for (attempts at) more caution and coordination, which the leading actors might at least somewhat take into consideration, e.g. for reasons along the lines of https://aiprospects.substack.com/p/paretotopian-goal-alignment.  I'm not particularly optimistic about coordination, especially the more ambitious kinds of plans (e.g. 'shut it all down', long pauses like in 'A narrow path...', etc.), and that's to a large degree (combined with short timelines and personal fit) why I'm focused on automated safety reseach. I'm just saying: 'if you feel like coordination is the best plan you can come up with/you're most optimistic about, there are probably more legible and likely also more truth-tracking arguments
2Bogdan Ionut Cirstea
(Also, what Thane Ruthenis commented below.)
2Noosphere89
I'd say the big factor that makes AI controllable right now is that the compute necessary to build AI that can do very good AI research to automate R&D and then the economy is locked behind TSMC/NVidia and ASML, and their processes are both nearly irreplaceable and very expensive to make, so it's way easier to intervene on the checkpoints requiring AI development than gain-of function research.
1sunwillrise
I agree, but I think this is slightly beside the original points I wanted to make.
8Thane Ruthenis
Agreed. I think a type of "stop AGI research" argument that's under-deployed is that there's no process or actor in the world that society would trust with unilateral godlike power. At large, people don't trust their own governments, don't trust foreign governments, don't trust international organizations, and don't trust corporations or their CEOs. Therefore, preventing anyone from building ASI anywhere is the only thing we can all agree on. I expect this would be much more effective messaging with some demographics, compared to even very down-to-earth arguments about loss of control. For one, it doesn't need to dismiss the very legitimate fear that the AGI would be aligned to values that a given person would consider monstrous. (Unlike "stop thinking about it, we can't align it to any values!".) And it is, of course, true.
4Akash
What kinds of conflicts are you envisioning? I think if the argument is something along the lines of "maybe at some point other countries will demand that the US stop AI progress", then from the perspective of the USG, I think it's sensible to operate under the perspective of "OK so we need to advance AI progress as much as possible and try to hide some of it, and if at some future time other countries are threatening us we need to figure out how to respond." But I don't think it justifies anything like "we should pause or start initiating international agreements." (Separately, whether or not it's "truer" depends a lot on one's models of AGI development. Most notably: (a) how likely is misalignment and (b) how slow will takeoff be//will it be very obvious to other nations that super advanced AI is about to be developed, and (c) how will governments and bureaucracies react and will they be able to react quickly enough.) (Also separately– I do think more people should be thinking about how these international dynamics might play out & if there's anything we can be doing to prepare for them. I just don't think they naturally lead to a "oh, so we should be internationally coordinating" mentality and instead lead to much more of a "we can do whatever we want unless/until other countries get mad at us & we should probably do things more secretly" mentality.)
4Bogdan Ionut Cirstea
I'm envisioning something like: scary powerful capabilities/demos/accidents leading to various/a coalition of other countries asking the US (and/or China) not to build any additional/larger data centers (and/or run any larger training runs), and, if they're scared enough, potentially even threatening various (escalatory) measures, including economic sanctions, blockading the supply of compute/prerequisites to compute, sabotage, direct military strikes on the data centers, etc. I'm far from an expert on the topic, but I suspect it might not be trivial to hide at least building a lot more new data centers/supplying a lot more compute, if a significant chunk of the rest of the world was watching very intently. I'm envisioning a very near-casted scenario, on very short (e.g. Daniel Kokotajlo-cluster) timelines, egregious misalignment quite unlikely but not impossible, slow-ish (couple of years) takeoff (by default, if no deliberate pause), pretty multipolar, but with more-obviously-close-to-scary capabilities, like ML R&D automation evals starting to fall.
4Akash
Thanks for spelling it out. I agree that more people should think about these scenarios. I could see something like this triggering central international coordination (or conflict). (I still don't think this would trigger the USG to take different actions in the near-term, except perhaps "try to be more secret about AGI development" and maybe "commission someone to do some sort of study or analysis on how we would handle these kinds of dynamics & what sorts of international proposals would advance US interests while preventing major conflict." The second thing is a bit optimistic but maybe plausible.)

Quick take on o1: overall, it's been a pretty good day. Likely still sub-ASL-3, (opaque) scheming still seems very unlikely because the prerequisites still don't seem there. CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety, because differentially transparent. Gains on math and code suggest these models are getting closer to being usable for automated safety research (also for automated capabilities research, unfortunately).

6Vladimir_Nesov
CoT inference looks more like the training surface, not essential part of resulting cognition after we take one more step following such models. Orion is reportedly (being) pretrained on these reasoning traces, and if it's on the order of 50 trillion tokens, that's about as much as there is natural text data of tolerable quality in the world available for training. Contrary to the phrasing, what transformers predict is in part distant future tokens within a context, not proximate "next tokens" that follow immediately after whatever the prediction must be based on. So training on reasoning traces should teach the models concepts that let them arrive at the answer faster, skipping the avoidable parts of the traces and compressing a lot of the rest into less scrutable activations. The models trained at the next level of scale might be quite good at that, to the extent not yet known from experience with the merely GPT-4 scale models.
5Noosphere89
Some bad news is that there was some problematic power seeking and instrumental convergence, though thankfully that only happened in an earlier model: https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1#JGwizteTkCrYB5pPb Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely: https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9

QwQ-32B-Preview was released open-weights, seems comparable to o1-preview. Unless they're gaming the benchmarks, I find it both pretty impressive and quite shocking that a 32B model can achieve this level of performance. Seems like great news vs. opaque (e.g in one-forward-pass) reasoning. Less good with respect to proliferation (there don't seem to be any [deep] algorithmic secrets), misuse and short timelines.

6Vladimir_Nesov
From proliferation perspective, it reduces overhang, makes it more likely that Llama 4 gets long reasoning trace post-training in-house rather than later, and so initial capability evaluations give more relevant results. But if Llama 4 is already training, there might not be enough time for the technique to mature, and Llamas have been quite conservative in their techniques so far.
1mrtreasure
There have been comments from OAI staff that o1 is "GPT-2 level" so I wonder if it's a similar size?
9ShardPhoenix
I think they meant that as an analogy to how developed/sophisticated it was (ie they're saying that it's still early days for reasoning models and to expect rapid improvement), not that the underlying model size is similar.

IIRC OAers also said somewhere (doesn't seem to be in the blog post, so maybe this was on Twitter?) that o1 or o1-preview was initialized from a GPT-4 (a GPT-4o?), so that would also rule out a literal parameter-size interpretation (unless OA has really brewed up some small models).

1Lee_0505
There was an article about it before the release. https://archive.is/IwKSP
3gwern
(Relevant, although "involving its GPT-4 AI model" is a considerably weaker statement than 'initialized from a GPT-4 checkpoint'.)

Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:

'We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.'

7ryan_greenblatt
Surely fundamentally at odds? You can't spend a while thinking without spending a while thinking. Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.

'Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.' https://epochai.org/blog/data-movement-bottlenecks-scaling-past-1e28-flop 

The post argues that there is a latency limit at 2e31 FLOP, and I've found it useful to put this scale into perspective. 

Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models. 

Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting. 

6anaguma
An important caveat to the data movement limit: “A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.” I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
3Bogdan Ionut Cirstea
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
6rotatingpaguro
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
3Bogdan Ionut Cirstea
under the assumptions here (including Chinchilla scaling laws), depth wouldn't increase by more than about 3x before the utilization rate starts dropping (because depth would increase with exponent about 1/6 of the total increase in FLOP); which seems like great news for the legibility of CoT outputs and similar and vs. opaque reasoning in models: https://lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
1anaguma
Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt This would imply that the data movement bottleneck is a bit further out.

Success at currently-researched generalized inference scaling laws might risk jeopardizing some of the fundamental assumptions of current regulatory frameworks.

  • the o1 results have illustrated specialized inference scaling laws, for model capabilities  in some specialized domains (e.g. math); notably, these don't seem to generally hold for all domains - e.g. o1 doesn't seem better than gpt4o at writing;
  • there's ongoing work at OpenAI to make generalized inference scaling work;
  • e.g. this could perhaps (though maybe somewhat overambitiously) be framed, in the language of https://epochai.org/blog/trading-off-compute-in-training-and-inference, as there no longer being an upper bound in how many OOMs of inference compute can be traded for equivalent OOMs of pretraining;
  • to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared f
... (read more)
2Bogdan Ionut Cirstea
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs. 

Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research). 

Some past examples I find valuable - in the case of unlearning: WMDP, Eight Methods to Evaluate Robust Unlearning in LLMs; in the case of mech interp - various proxies for SAE performance, e.g. from Scaling and evaluating sparse autoencoders, as well as various benchmarks, e.g. FIND: A Function Description Benchmark for Evaluating Interpretability Methods. Prizes and RFPs seem like a potentially scalable way to do this - e.g. https://www.mlsafety.org/safebench - and I think they could be particularly useful on short timelines.

Including differentially vs. doing the full stack of AI safety work - because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.

Better proxies could also make it easier to productively deploy ... (read more)

4Raemon
I'm torn between generally being really fucking into improving feedback loops (and thinking they are a good way to make it easier to make progress on confusing questions), and, being sad that so few people are actually just trying to actually directly tackle the hard bits of the alignment challenge.
4Bogdan Ionut Cirstea
Some quick thoughts: * automating research that 'tackles the hard bits' seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback - which does suggest it might be valuable to recruit more agent foundations researchers today * but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops * I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback * the hard bits might also become much less hard - or prove to be irrelevant - if for at least some of them empirical feedback could be obtained by practicing on much more powerful models Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research. (Edit: but also, I think if funders and field builders 'went hard', it might be much less necessary to choose.) 
2Bogdan Ionut Cirstea
Potentially also https://www.lesswrong.com/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking. 

I think this one (and perhaps better operationalizations) should probably have many eyes on it: 

4niplav
I also have this market for GPQA, on a longer time-horizon: https://manifold.markets/NiplavYushtun/will-the-gap-between-openweights-an

Jack Clark: 'Registering a prediction: I predict that within two years (by July 2026) we'll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we'll see the same thing - an AI system beating all humans in a known-hard competition - in another scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.' https://importai.substack.com/p/import-ai-380-distributed-13bn-parameter 

8Jacob Pfau
Prediction markets on similar questions suggest to me that this is a consensus view. * General LLMs 44% to get gold on the IMO before 2026. This suggests the mathematical competency will be transferrable--not just restricted to domain-specific solvers. * LLMs favored to outperform PhD students in their own subject before 2026 With research automation in mind, here's my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.
5Bogdan Ionut Cirstea
Fwiw, I've kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.

(cross-posted from https://x.com/BogdanIonutCir2/status/1844451247925100551, among others)

I'm concerned things might move much faster than most people expected, because of automated ML (figure from OpenAI's recent MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, basically showing automated ML engineering performance scaling as more inference compute is used):

https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=dyndDEn9qqdt9Mhx2 

https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogda... (read more)

I find it pretty wild that automating AI safety R&D, which seems to me like the best shot we currently have at solving the full superintelligence control/alignment problem, no longer seems to have any well-resourced, vocal, public backers (with the superalignment team disbanded).

[-]mishka132

I think Anthropic is becoming this org. Jan Leike just tweeted:

https://x.com/janleike/status/1795497960509448617

I'm excited to join @AnthropicAI to continue the superalignment mission!

My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.

If you're interested in joining, my dms are open.

5Carl Feynman
On what basis do you think it’s the ‘best shot’?  I used to think it was a good idea, a few years ago, but in retrospect I think that was just a computer scientist’s love of recursion.  I don’t think that at present conditions are good for automating R&D.  On the one hand, we have a lot of very smart people working on AI safety R&D, with very slow progress, indicating it is a hard problem.  On the other hand, present-day LLMs are stupid at long-term planning, and acquiring new knowledge, which are things you need to be good at to do R&D.   What advantage do you see AIs having over humans in this area?
7Nathan Helm-Burger
I think there will be a period in the future where AI systems (models and their scaffolding) exist which are sufficiently capable that they will be able to speed up many aspects of computer-based R&D. Including recursive-self-improvement, Alignment research and Control research. Obviously, such a time period will not be likely to last long given that surely some greedy actor will pursue RSI. So personally, that's why I'm not putting a lot of faith in getting to that period [edit: resulting in safety]. I think that if you build the scaffolding which would make current models able to be substantially helpful at research (which would be impressively strong scaffolding indeed!), then you have built dual-use scaffolding which could also be useful for RSI. So any plans to do this must take appropriate security measures or they will be net harmful.
2Bogdan Ionut Cirstea
Agree with a lot of this, but scaffolds still seem to me pretty good, for reasons largely similar to those in https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_.  
4Bogdan Ionut Cirstea
I kind of wish this was true, because it would likely mean longer timelines, but my expectation is that the incoming larger LMs + better scaffolds + more inference-time compute could quite easily pass the threshold of significant algorithmic progress speedup from automation (e.g. 2x).
5Jacob Pfau
Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation: Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. ... Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn't true). Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it's valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.
3Bogdan Ionut Cirstea
Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated. Agree, I've written about (something related to) this very recently.
4Carl Feynman
Yes, things have certainly changed in the four months since I wrote my original comment, with the advent of o1 and Sakana’s Artificial Scientist.  Both of those are still incapable of full automation of self-improvement, but they’re close.  We’re clearly much closer to a recursive speed up of R&D, leading to FOOM.

from https://jack-clark.net/2024/08/18/import-ai-383-automated-ai-scientists-cyborg-jellyfish-what-it-takes-to-run-a-cluster/…, commenting on https://arxiv.org/abs/2408.06292: 'Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five yea... (read more)

-4Seth Herd
Yep. I find it pretty odd that alignment people say, with a straight face, "LLM agents are still pretty dumb so I don't think that's a path to AGI". I thought the whole field was about predicting progress and thinking out ahead of it. Progress does tend to happen when humans work on something. Progress probably happens when parahumans work on things, too.
8habryka
Who says those things? That doesn't really sound like something that people say. Like, I think there are real arguments about why LLM agents might not be the most likely path to AGI, but "they are still pretty dumb, therefore that's not a path to AGI" seems like obviously a strawman, and I don't think I've ever seen it (or at least not within the last 4 years or so).
2Seth Herd
Fair enough; this sentiment is only mentioned offhand in comments and might not capture very much of the average opinion. I may be misestimating the community's average opinion. I hope I am wrong, and I'm glad to see others don't agree! I'm a bit puzzled on the average attitude toward LLM agents as a route to AGI among alignment workers. I'm still surprised there aren't more people working directly on aligning LLM agents. I'd think we'd be working harder on the most likely single type of first AGI if lots of us really believed it's reasonably likely to get there (especially since it's probably the fastest route if it doesn't plateau soon). One possible answer is that people are hoping that the large amount of work on aligning LLMs will cover aligning LLM agents. I think that work is helpful but not sufficient, so we need more thinking about agent alignment as distinct from their base LLMs. I'm currently writing a post on this.
2habryka
(Most people in AI Alignment work at scaling labs and are therefore almost exclusively working on LLM alignment. That said, I don't actually know what it means to work on LLM alignment over aligning other systems, it's not like we have a ton of traction on LLM alignment, and most techniques and insights seem general enough to not be conditional specifically on LLMs)
6Steven Byrnes
I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong. Ooh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about. As a typical example, I didn’t get anything useful out of Alignment Implications of LLM Successes: a Debate in One Act—it’s addressing a debate that I see as inapplicable to the types of AI algorithms that I’m thinking about. Ditto for the debate on chain-of-thought accuracy vs steganography and a zillion other things. When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed. I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).
2Noosphere89
I agree with the claim that the techniques and insights for alignment that are usually considered are not conditional on LLMs specifically, including my own plan for AI alignment.

I recently gave a talk (slides) on some thoughts about what automating AI safety research might look like. 

Some [earlier versions] of the ideas there were developed during my Astra Fellowship Winter '24 with @evhub and through related conversations in Constellation.

On an apparent missing mood - FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely 

Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post): 

each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.

Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happ... (read more)

7ryan_greenblatt
My main vibe is: * AI R&D and AI safety R&D will almost surely come at the same time. * Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work) * People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go). It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
1Bogdan Ionut Cirstea
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals). This seems good w.r.t. automated AI safety potentially 'piggybacking', but bad for differential progress. Sure, though wouldn't this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time? 
2ryan_greenblatt
Why think this is important to measure or that this already isn't happening? E.g., on the current model organism related project I'm working on, I automate inspecting reasoning traces in various ways. But I don't feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn't more important than other tips for doing LLM research better).
3Bogdan Ionut Cirstea
Intuitively, I'm thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this 'race', corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence). W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
1Bogdan Ionut Cirstea
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:

If this generalizes, OpenAI's Orion, rumored to be trained on synthetic data produced by O1, might see significant gains not just in STEM domains, but more broadly - from O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?:

'this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of sampl... (read more)

(crossposted from X/twitter)

Epoch is one of my favorite orgs, but I expect many of the predictions in https://epochai.org/blog/interviewing-ai-researchers-on-automation-of-ai-rnd to be overconservative / too pessimistic. I expect roughly a similar scaleup in terms of compute as https://x.com/peterwildeford/status/1825614599623782490… - training runs ~1000x larger than GPT-4's in the next 3 years - and massive progress in both coding and math (e.g. along the lines of the medians in https://metaculus.com/questions/6728/ai-wins-imo-gold-medal/… https://metacu... (read more)

(cross-posted from X/twitter)

The already-feasibility of https://sakana.ai/ai-scientist/ (with basically non-x-risky systems, sub-ASL-3, and bad at situational awareness so very unlikely to be scheming) has updated me significantly on the tractability of the alignment / control problem. More than ever, I expect it's gonna be relatively tractable (if done competently and carefully) to safely, iteratively automate parts of AI safety research, all the way up to roughly human-level automated safety research (using LLM agents roughly-shaped like the AI scientist... (read more)

[-]habryka2618

I think currently approximately no one is working on the kind of safety research that when scaled up would actually help with aligning substantially smarter than human agents, so I am skeptical that the people at labs could automate that kind of work (given that they are basically doing none of it). I find myself frustrated with people talking about automating safety research, when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.

2Bogdan Ionut Cirstea
Can you give some examples of work which you do think represents progress? My rough mental model is something like: LLMs are trained using very large-scale (multi-task) behavior cloning, so I expect various tasks to be increasingly automated / various data distributions to be successfully, all the way to (in the limit) at roughly everything humans can do, as a lower bound; including e.g. the distribution of what tends to get posted on LessWrong / the Alignment Forum (and agent foundations-like work); especially when LLMs are scaffolded into agents, given access to tools, etc. With the caveats that the LLM [agent] paradigm might not get there before e.g. it runs out of data / compute, or that the systems might be unsafe to use; but against this was where https://sakana.ai/ai-scientist/ was a particularly significant update for me; and 'iteratively automate parts of AI safety research' is also supposed to help with keeping systems safe as they become increasingly powerful.

Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).

2Nathan Helm-Burger
I think the Corrigibility agenda, framed as "do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals" is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than "solved by default, no need to worry".

Prototype of LLM agents automating the full AI research workflow: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.

And already some potential AI safety issues: 'We have noticed that The AI Scientist occasionally tries to increase its chance of success, such as modifying and launching its own execution script! We discuss the AI safety implications in our paper.

For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too ... (read more)

5Bogdan Ionut Cirstea
Quick take: I think LM agents to automate large chunks of prosaic alignment research should probably become the main focus of AI safety funding / person-time. I can't think of any better spent marginal funding / effort at this time.

RSPs for automated AI safety R&D require rethinking RSPs

AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated. 

In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.

(Related: On an apparent missing mood - FOMO on all ... (read more)

A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL - imitation learning (I don't feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like: 

Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: 'This work shows that, from the statisti... (read more)

5ryan_greenblatt
IL = imitation learning.
4ryan_greenblatt
I'd bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.
1Bogdan Ionut Cirstea
Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun's cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn't seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.
4ryan_greenblatt
I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.) I'm skeptical hard theory had much influence on anyone though. (In this domain at least.)
1Bogdan Ionut Cirstea
Some more (somewhat) related papers: Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity ('We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.'). On Representation Complexity of Model-based and Model-free Reinforcement Learning ('We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory

quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they're considering any model internals work (and of course even more so if they're specifically considering mech interp)

56% on swebench-lite with repeated sampling (13% above previous SOTA; up from 15.9% with one sample to 56% with 250 samples), with a very-below-SOTA model https://arxiv.org/abs/2407.21787; anything automatically verifiable (large chunks of math and coding) seems like it's gonna be automatable in < 5 years.

5Bogdan Ionut Cirstea
The finding on the differential importance of verifiability also seems in line with the findings from Trading Off Compute in Training and Inference.  

(epistemic status: quick take, as the post category says)

Browsing though EAG London attendees' profiles and seeing what seems like way too many people / orgs doing (I assume dangerous capabilities) evals. I expect a huge 'market downturn' on this, since I can hardly see how there would be so much demand for dangerous capabilities evals in a couple years' time / once some notorious orgs like the AISIs build their sets, which many others will probably copy. 

While at the same time other kinds of evals (e.g. alignment, automated AI safety R&D, even control) seem wildly neglected.

6gwern
There's still lots and lots of demand for regular capability evaluation, as we keep discovering new issues or LLMs keep tearing through them and rendering them moot, and the cost of creating a meaningful dataset like GPQA keeps skyrocketing (like ~$1000/item) compared to the old days where you could casually Turk your way to questions LLM would fail (like <$1/item). Why think that the dangerous subset would be any different? You think someone is going to come out with a dangerous-capabilities eval in the next year and then that's it, it's done, we've solved dangerous-capabilities eval, Mission Accomplished?
3Bogdan Ionut Cirstea
If it's well designed and kept private, this doesn't seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that's not an infinite set. But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.
8gwern
It's not an infinite set and returns diminish, but that's true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don't diminish as much, because benchmarks 'rot': quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there's leakage, which is just as relevant to dangerous capabilities as regular capabilities - OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama... Even if it's not leakage in the most literal sense of memorizing the exact wording of a question, there's still 'meta-leakage' of overfitting to that sort of question.)
2Nathan Helm-Burger
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals...  For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level? The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations. I do agree that we shouldn't put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem. Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.  Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks. For what it's worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.   Another aspect of the
2Chris_Leong
What's DC?
2Bogdan Ionut Cirstea
*dangerous capabilities; will edit