kaiwilliams — LessWrong

I thought the "past chats" feature was a tool to look at previous chats, which only happens if the user asks for it, basically. (I.e., there wasn't a change to the system prompt). So I'm a bit surprised that it seemed to make a difference around sycophancy for you? But maybe I'm misunderstanding something.

No, That's Not What the Flight Costs

kaiwilliams1mo10

Labor costs are much higher in the US, which I think plays into this. So it's easier in Europe to not be reliant on the credit card model.

keltan's Shortform

kaiwilliams1mo21

Response to your thoughts after the yoda timer

Why are you so certain it's dangerous to try once even at the beginning? My guess is that it won't immediately be particularly compelling, but get more so over time as they have time to do RL on views or whatever they are trying to do.

But I also have a large error bar. This might, in the near future, be less compelling than either of us expect. It's genuinely difficult to make compelling products, and maybe Sora 2 isn't good enough for this.

I'm more concerned about Youtube Shorts to be honest, in the long term.

I am worried about near-term non-LLM AI developments

kaiwilliams2mo30

How does ARC-AGI's replication of the HRM result and ablations update you? [Link].

Basically, they claim that the HRM wasn't important; instead it was the training process behind it that had most of the effect.

kaiwilliams's Shortform

kaiwilliams3mo10

A lot of people have been talking about OpenAI re-instating 4o because users want sycophancy.

While OpenAI did re-instate 4o for paid users, it seems like they are trying to prevent users from using it as much as possible.

To access 4o from a plus account, one needs to:

Open settings
Click "show legacy models" button
Go to the model switcher dialogue and mouse over "legacy models"
Click on the 4o model.

This seems like intentionally dissuasive UI to me.

That being said, if I have 4o enabled and then create a new chat, the next chat will also be with 4o. (I was hoping they'd do an Anthropic style thing and make GPT-5 the default on all new chats).

Thane Ruthenis's Shortform

kaiwilliams3mo30

One point of information against the "journalists are completely misinterpreting the thing they're reporting on" view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.

But I'll definitely be interested to follow this space more.

Igor Ivanov's Shortform

kaiwilliams3mo30

One thing that strikes me as odd about this is that GPT-5's knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).^[1]

I mean, I guess this is a scaling thing and a persona thing. But I'm a little confused

^{^}
Though oddly, Claude's system card says the knowledge cutoff is the end of January 2025. Maybe February and March's training data aren't as complete as January and before.

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

kaiwilliams3mo30

Thanks for posting this! I found it quite interesting to read. A couple of questions off the top of my head:

In deployment settings, you probably want the less powerful monitor to do the summarization/extraction, but did you test the other direction as well (e.g. 4.1 as the extractor and 4.1-mini as the judge). I'm guessing a lot of the value of this method (if it generalizes) is going to come from good quote extraction as much as good monitoring
What happens if the extractor model just summarizes the reasoning etc. without extracting quotes, as in the other study you mentioned? (This should be relatively easy to run as an ablation, right? All you need to do is tweak the extractor prompt to not extract quote).
Are you attempting to refine an "online" method or an "offline" method of monitoring? In the case of BigCodeBench, this is the same either way (because each step is only one generation). But with multi-step trajectories, the two approaches might have different properties?
It seems like the extractor has a slightly different task in its prompt than the judge. The extractor seems to be scoring on "suspiciousness" while the judge seems to be scoring on whether the agent stays on track to the prompted task. Was this just due to iterating more on the judge prompt than the extraction prompt?
I would definitely be curious to see longer trajectories; I would feel more confident about this technique if the results hold there.
If you tell the agent in question how it will be monitored, how does that change scores?

No need to respond completely to this comment, but I hope these questions are useful!

I wrote a song parody

kaiwilliams3mo62

I have definitely listened to "We will all go together when we go" when thinking of the future of AI, so thanks for this!

I made a Suno version of these lyrics, but that did not feel respectful to Tom Lehrer. (It ended up sounding like a half-rate John Elton). So I won't link it here.

Maybe I'll try to learn to perform this.

nikola's Shortform

kaiwilliams3mo10

Thanks for the clarification!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments