Ariel_ — LessWrong

LESSWRONG
is fundraising!
LW

Ariel_ — LessWrong

Alignment remains a hard, unsolved problem

The "usefulness" reported in papers is how much work can be gotten out of an aligned model.

Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the "untrusted model" with honest prompting -

We normalize our usefulness numbers such that gets 100% usefulness and T gets 0% [ $U_{H}$ is GPT-4o mini acting honestly]

In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.

Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the "actual" $U_{H}$ which would be the untampered model. But that might not generalize to more capable models (?)

Would be curious for links to Redwoods justifications if you have them

Plans A, B, C, and D for misalignment risk

Ariel_1mo10

Any good posts/papers discussing "handover"? e.g. the handover of AI research to AI R&D agents (the plan of the original OpenAI Superalignment team). I'm also interested in any adjacent research agendas which might help the handover succeed.

Some of the more relevant work i've read (other than this post) are Wentworth's slop post, various scalable oversight/safety case papers, automation collapse.

Scheming Toy Environment: "Incompetent Client"

Ariel_3mo10

Thanks! Tbh, I never would have posted it if not for encouragement from a friend

Linda Linsefors's Shortform

Ariel_5mo10

Yeah, I agree premotrem is not super commonly used. Not sure where I learned it, maybe an org design course. I mainly gave that as an example of over-eagerness to name existing things - perhaps there aren't that many examples which are as clear cut, maybe in many of them the new term is actually subtly different from the existing term.

But I would guess that a quick Google search could have found the "premortem" term, and reduced one piece of jargon.

Linda Linsefors's Shortform

Ariel_5mo10

In my experience, people saying they "updated" did not literally change a percentage or propagate a specific fact through their model. Maybe it's unrealistic to expect it to be so granular, but to me it devalues the phrase and so I try to avoid it unless I can point to a somewhat specific change in my model. Whereas usually my model (e.g. of a subset of AI risks) is not really detailed enough to actually perform a Bayesian update, but more to just generally change my mind or learn something new and maybe gradually/subconsciously rethink my position.

Maybe I have too high bar for what counts as a bayesian updates - not sure? But if not, then I think "I updated" would count more often as social signaling or as appropriation of a technical term to a non-technical usage. Which is fine, but seems less than ideal for LW/AI Safety people.

So I would say that jargon has this problem (of being used too casually/technically imprecise) sometimes, even if I agree with your inferential distance point.

As far as LW jargon being interchangeable with existing language - one case I can think of is "murphyjitsu", which is basically exactly a "premortem" (existing term) - so maybe there's a bit of over-eagerness to invent a new term instead of looking for an existing one.

Announcing Trajectory Labs - A Toronto AI Safety Office

Ariel_7mo51

Congrats! Can confirm that it is a great office :)

Orienting Toward Wizard Power

Ariel_7mo30

my dream dwelling was a warehouse filled with whatever equipment one could possibly need to make things and run experiments in a dozen different domains

I had a similar dream, though I mostly thought about it from the context of "building cool fun mechanical stuff" and working on cars/welding bike frames. I think the actual usefulness might be a bit overrated, but still would be fun to have.

I do have a 3D printer though, and a welder (though I don't have anywhere to use it - needs high voltage plug). Again though not sure how useful these things are - it seems to me like it is mostly for fun, and in the end the novelty wears off a bit once I realize that building something actually useful will take way more time than I want to spend on non-AI safety work.

But maybe that is something I shouldn't have given up on that quickly, perhaps it is a bit of "magic" that makes life fun and maybe even a few actually cool inventions could come from this kind of tinkering. And maybe that would also permeate into how I approach my AI safety work.

The EU Is Asking for Feedback on Frontier AI Regulation (Open to Global Experts)—This Post Breaks Down What’s at Stake for AI Safety

Ariel_8mo20

Sure! and yeah regarding edits - I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.

There is also an Evals workshop in Brussels on Monday where we might learn more. I've know of some some non-EU based technical safety researchers who are attending, which is great to see.

The EU Is Asking for Feedback on Frontier AI Regulation (Open to Global Experts)—This Post Breaks Down What’s at Stake for AI Safety

Ariel_8mo30

I'd suggest updating the language in the post to clarify things and not overstate :)

Regarding the 3rd draft - opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability - the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn't something the code can do, and also probably shouldn't as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.

The EU Is Asking for Feedback on Frontier AI Regulation (Open to Global Experts)—This Post Breaks Down What’s at Stake for AI Safety

Ariel_8mo910

FYI I wouldn't say at all that AI safety is under-represented in the EU (if anything, it would be easier to argue the opposite). Many safety orgs (including mine) supported the Codes of Practice, and almost all the Chairs and vice chairs are respected governance researchers. But probably still good for people to give feedback, just don't want to give the impression that this is neglected.

Also no public mention of intention to sign the code has been made as far as I know. Though apart from copyright section, most of it is in line with RSPs, which makes signing more reasonable.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments