Yoav — LessWrong

Yoav23d407

I got Opus to translate sections to Hebrew (from memory), and found it really interesting what details it modified/dropped. For instance it never outputted anything about senior Anthropic employees, I find it plausible it didn't internalize that very strongly.

Would be a cool experiment to run this many times in different languages, maybe get sonnet to compare to the original and highlight more/less consistent points.

How much novel security-critical infrastructure do you need during the singularity?

Yoav4mo10

For example, instead of working on separate computers, they might want to work as different users (or the same user!) on the same Linux server.

Oh god... hope you're wrong about this providing a significant efficiency boost. If working on the same server is viable then presumably there's a pretty homogenous software stack, which would also mean docker images can often be reused?

The AI company builds new security-critical infra really fast for the sake of security

This is basically in our SL5 recommendations, with the assumption that the model you'd use for this didn't merit being trained in an SL5 data center and so is not a catastrophic risk. But maybe it could already misaligned and sneaky enough to introduce a backdoor even if it's not smart enough to be a catastrophic risk? Do we have a prediction for number of model generations between models still being dumb enough to trust, and models being smart enough to be a catastrophic risk? (with the best case being 1). If we buy the AI-2027 distinction between superhuman coder and superhuman AI researcher, it could even be the case that the big scary model doesn't write any code, and only comes up with ideas which are then implemented by the superhuman coder, or whatever coder you have that is the smartest one you trust.

At some point, I expect the AI company will want to use its smartest models to create the infrastructure necessary to serve itself over the internet while remaining SL5. Hopefully this only happens after they're justifiably confident that their model is aligned.

Another idea is to require that any infra created by an untrusted model has to be formally verified against a specification written by a human or trusted model, though I don't know enough about formal verification to know if that's good enough.

Agents lag behind AI 2027's schedule

Yoav5mo30

SWEBench Verified - Anthropic claims 80.2% with parallel test time compute, I think we should count it?

Help the AI 2027 team make an online AGI wargame

Yoav6mo2-2

I would like to register that I think this is very vibe-codeable given the right tools, and if there's an interested game designer that doesn't have full stack experience I would be willing to help them get up to speed on vibe coding best practices for this project.

Orienting Toward Wizard Power

Yoav7mo30

Hey this sounds a bit like Arbor Summer Camp. Yeah this is a beautiful vision

Evaluating Oversight Robustness with Incentivized Reward Hacking

Yoav8mo10

I appreciate the praise and insights!

I hadn't thought of the sandbagging version of adversarial evals and it sounds interesting, although I'm a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?

Asymmetries in the reward structure - If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.

Open to hearing more ideas, I agree there's more than can be done with this set up

When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation

Yoav1y30

I had the impression collective punishment was disallowed in the IDF, but as far as I can tell by googling this only applies to keeping soldiers from their vacations (including potentially a weekend). I couldn't find anything about the origin but I bet collectively keeping a unit from going home was pretty common before it was disallowed in 2015, and I think it still happens today sometimes even though it's disallowed.

source: https://www.idf.il/%D7%90%D7%AA%D7%A8%D7%99-%D7%99%D7%97%D7%99%D7%93%D7%95%D7%AA/%D7%90%D7%AA%D7%A8-%D7%94%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA/%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA-%D7%9E%D7%98%D7%9B-%D7%9C/%D7%9E%D7%A9%D7%98%D7%A8-%D7%95%D7%9E%D7%A9%D7%9E%D7%A2%D7%AA-33/%D7%A9%D7%99%D7%A4%D7%95%D7%98-03/%D7%9E%D7%A0%D7%99%D7%A2%D7%AA-%D7%97%D7%95%D7%A4%D7%A9%D7%94-33-0352/

DeepMind: Model evaluation for extreme risks

Yoav3y10

I disagree with almost everything you wrote, here are some counter-arguments:

Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.
Everyone is obviously aware that 'alignment evals' will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
1. Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
2. Finetuning the models to be better at dangerous things (I believe that users won't have finetuning access to the strongest models)
3. Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators. Basically - if they achieve dangerous capabilities, it's really good if the government knows, because it will inform regulation.
No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)

That said, I don't claim that everything is perfect and we're all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don't yet see a good plan to enforce compliance. I'm also afraid of what will happen if we get stuck on not being able to confidently align a system that we've identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).

Finally - I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments

The AI company builds new security-critical infra really fast for the sake of security