LESSWRONG
LW

812
emile delcourt
28140
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
My AGI timeline updates from GPT-5 (and 2025 so far)
emile delcourt2mo20

What if work longer than one week was vastly sublinear in complexity for AI—meaning that doubling task duration doesn’t double difficulty for an agent? 

This could be a corollary to your point on superexponential results, but maybe less from a capability standpoint but environmental overhangs to which humans may not have fully optimized. Especially in a multi agentic direction.

Reply
An Introduction to AI Sandbagging
emile delcourt11mo11

Really appreciate this post! The recommendation "Evaluators should ensure that effective capability elicitation techniques are used for their evaluations" is especially important. Zero-shot, single-turn prompts with no transformations no longer seem representative of a model's impact on the public (who, in aggregate or with scant determination, will be inflicting many variants of unsanctioned prompts with many shots or many turns)

Reply
An Introduction to AI Sandbagging
emile delcourt11mo10

I'm curious why in example 1 the ability to manipulate ("persuade") is called a capability evaluation, making limited results eligible for the sandbagging label, whereas in example 6 (under the name "blackmail") it is called an alignment evaluation, making limited results ineligible for that label?

Both examples tuned out the manipulation enough to hide it in testing, with worse results in production. Can someone help me to better learn the nuances we'd like to impose on sandbagging? Benchmark evasion is an area I started getting into only in november.

Reply
Open Thread Summer 2024
emile delcourt1y92

Hi! Just introducing myself to this group. I'm a cybersecurity professional, enjoyed various deep learning adventures over the last 6 years and inevitably managing AI related risks in my information security work.  Went through BlueDot's AI safety fundamentals last spring with lots of curiosity and (re?)discovered LessWrong. Looking forward to visiting more often, and engaging with the intelligence of this community to sharpen how I think.

Reply
4Scoping LLMs
7mo
0
18Can Large Language Models effectively identify cybersecurity risks?
1y
0