Vili Kohonen — LessWrong

My AGI safety research—2025 review, ’26 plans

Exceptional work, well-founded and everything laid out clearly with crosslinks. Thanks a lot Steven!

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

The main reason being is that I think there's a very large correlation between "not being scary" and "being commercially viable", so I expect a lot of pressure for non-scary systems.

The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It's definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).

Curious what evidence makes you point towards "being a near-total black box" refrains adoption of these systems? Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.

Further, "being incredibly hard to intentionally steer" is a baseline assumption for me how practically any conceivable agentic AI works, and given that we almost surely cannot get statistical guarantees about AI agent behaviour in open settings I don't see any reason (especially in the current political environment) that this property would be a showstopper.

Lessons from organizing a technical AI safety bootcamp

Vili Kohonen3mo*10

For example, while inviting speakers was mainly Dmitrii's responsibility, it wasn't that easy and some of them were invited by me as I knew relevant people personally. We went over several smaller tasks together, like which questions to include in the feedback forms and what kind of information we needed to provide about Helsinki for people coming from abroad.

In addition, two dynamics really increased the number of our ad hoc meetings:

We really wanted to do things well. Dmitrii liked to ask for my feedback and I readily gave it to him, and sometimes vice versa.
Gather Town made it super easy to have a chat. When another person is online you can wave them to your desk with two clicks and for the other person accepting is only one click.

I think these things improved the program. But again, we were operating reactively and it was quite stressful. Had we specced the bootcamp properly, used e.g. Trello and had structured meetings where we would have needed to only check everything's going mostly as planned, I think we would have saved loads of time and reduced stress levels. We just didn't have the time nor experience this time.

Prompt optimization can enable AI control research

Vili Kohonen3mo40

Have to highlight here that this wasn't even the initial project of Mia and Zach; they pivoted halfway through the week from trying to craft better red team policies to this. Automated prompt engineering and fine-tuning from that seems to fit the blue team's bill better. The monitor as a classifier is arguably more straightforward than creating effective attack policies for the red team, although it's important to pursue strategies to find upper bounds for both teams.

Making deals with early schemers

Vili Kohonen6mo10

Good post, thanks. I especially agree with the statements on current research teaching AIs synthetic facts and following through with deals during research.

It seems very likely that we have to coordinate with AIs in the future and we want maximum credibility to do so. If we aim to advance capabilities to program lies to AIs, while promising from safety research perspective, we probably should be very clear when and how this manipulation is done for it to not undermine credibility. If we develop such capabilities further we are also making the current most potential pathway to honor our commitments, fine-tuning the contract to the model, more uncertain from the model's perspective.

Committing to even superficial deals early and often is a strong signal of credibility. We want to accumulate as much of this kind of evidence. This is important for human psychology as well. As mentioned, dealmaking with AIs is a fringe view societally at the moment, and if there's no precedence for it by even the most serious safety researchers, it is much a larger step for the safety community to bargain for larger amounts of human-possessed resources if push comes to shove at some point.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments