LESSWRONG
LW

95
Lukas Petersson
296580
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
You can't eval GPT5 anymore
Lukas Petersson9d30

Hey Ted! Any updates? :)

Reply
You can't eval GPT5 anymore
Lukas Petersson1mo30

We set it to some date in the future

Reply
You can't eval GPT5 anymore
Lukas Petersson1mo20

Thanks! Vending-Bench v2 is going to be fire. Would love to include gpt5 <3

Reply
You can't eval GPT5 anymore
Lukas Petersson1mo10

This is a great point. I admit I have to better understand what each model provider does behind the scenes in the API. Sad if the days of access to the model is gone.

Reply
You can't eval GPT5 anymore
Lukas Petersson1mo100

We thought about that, but then it's not reproducible if we want to run it for new models later

Reply
You can't eval GPT5 anymore
Lukas Petersson1mo172

Thanks, that would be great!

Reply
Project Vend: Can Claude run a small shop?
Lukas Petersson4mo40

Thanks for highlighting our work!

Reply1
Which evals resources would be good?
Lukas Petersson1y70

This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don't have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We're currently using Inspect's basic_agent. Metr's modular_public is better, but we prefer inspect over vivaria otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.

Reply
158You can't eval GPT5 anymore
1mo
12
39AI misbehaviour in the wild from Andon Labs' Safety Report
2mo
0
7The Same Heaven
6mo
1
5Linguistic Imperialism in AI: Enforcing Human-Readable Chain-of-Thought
8mo
0
58AI Safety as a YC Startup
9mo
9