LESSWRONG
LW

Bogdan Ionut Cirstea
1642Ω13374540
Message
Dialogue
Subscribe

Automated / strongly-augmented safety research.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Bogdan Ionut Cirstea's Shortform
2y
244
No wikitag contributions to display.
Bogdan Ionut Cirstea's Shortform
Bogdan Ionut Cirstea9mo267

I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.

I'd guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k's $.

In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.

Even the auto ML R&D benchmarks from METR or UK AISI don't give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.

A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
 

This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.

Reply41111
Thomas Kwa's Shortform
Bogdan Ionut Cirstea16d20

I would love to see an AI safety R&D category. 

My intuition is that quite a few crucial AI safety R&D tasks are probably much shorter-horizon than AI capabilities R&D, which should be very helpful for automating AI safety R&D relatively early. E.g. the compute and engineer-hours time spent on pretraining (where most capabilities [still] seem to be coming from) are a-few-OOMs larger than those spent on fine-tuning (where most intent-alignment seems to be coming from).

Reply
METR's Observations of Reward Hacking in Recent Frontier Models
Bogdan Ionut Cirstea22d70

Seems pretty minor for now though:


The actual cheating behavior METR has observed seems relatively benign (if annoying). While it’s possible to construct situations where this behavior could cause substantial harm, they’re rather contrived. That’s because the model reward hacks in straightforward ways that are easy to detect. When the code it writes doesn’t work, it’s generally in a way that’s easy to notice by glancing at the code or even interacting with the program. Moreover, the agent spells out the strategies it’s using in its output and is very transparent about what methods it’s using.

Inasmuch as reward hacking is occurring, we think it’s good that the reward hacking is very obvious: the agents accurately describe their reward hacking behavior in the transcript and in their CoT, and the reward hacking strategies they use typically cause the programs they write to fail in obvious ways, not subtle ones. That makes it an easier-to-spot harbinger of misalignment and makes it less likely the reward hacking behavior (and perhaps even other related kinds of misalignment) causes major problems in deployment that aren’t noticed and addressed.

Reply
Does the Universal Geometry of Embeddings paper have big implications for interpretability?
Answer by Bogdan Ionut CirsteaMay 27, 202582

Yes, I do think this should be a big deal, and even more so for monitoring (than for understanding model internals). It should also have been at least somewhat predictable, based on theoretical results like those in I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? and in All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling.

Reply
A "Bitter Lesson" Approach to Aligning AGI and ASI
Bogdan Ionut Cirstea2mo82

I suspect you'd be interested in this paper, which seems to me like a great proof of concept: Safety Pretraining: Toward the Next Generation of Safe AI.

Reply
Bogdan Ionut Cirstea's Shortform
Bogdan Ionut Cirstea3mo11

In light of recent works on automating alignment and AI task horizons, I'm (re)linking this brief presentation of mine from last year, which I think stands up pretty well and might have gotten less views than ideal:

Towards automated AI safety research

Reply
Bogdan Ionut Cirstea's Shortform
Bogdan Ionut Cirstea4mo200

The first automatically produced, (human) peer-reviewed,  (ICLR) workshop-accepted[/able] AI research paper: https://sakana.ai/ai-scientist-first-publication/ 

Reply
Bogdan Ionut Cirstea's Shortform
Bogdan Ionut Cirstea4mo00

There have been numerous scandals within the EA community about how working for top AGI labs might be harmful. So, when are we going to have this conversation: contributing in any way to the current US admin getting (especially exclusive) access to AGI might be (very) harmful?

[cross-posted from X]

Reply
Detecting Strategic Deception Using Linear Probes
Bogdan Ionut Cirstea5mo40

I find the pessimistic interpretation of the results a bit odd given considerations like those in https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed. 

Reply1
Daniel Kokotajlo's Shortform
Bogdan Ionut Cirstea5moΩ240

I also think it's important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.

Reply
Load More
9Densing Law of LLMs
7mo
2
11LLMs Do Not Think Step-by-step In Implicit Reasoning
7mo
0
9Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
7mo
0
14Disentangling Representations through Multi-task Learning
7mo
1
11Reward Bases: A simple mechanism for adaptive acquisition of multiple reward type
7mo
0
16A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers
7mo
0
11The Computational Complexity of Circuit Discovery for Inner Interpretability
8mo
2
7Thinking LLMs: General Instruction Following with Thought Generation
9mo
0
17Instruction Following without Instruction Tuning
9mo
0
7Validating / finding alignment-relevant concepts using neural data
9mo
0
Load More