omegastick — LessWrong

It's hard to make scheming evals look realistic for LLMs

Then we'll need a "thought process tampering awareness" evaluation.

Interpreting the METR Time Horizons Post

if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)

This doesn't seem like a good example to me.

The sort of tasks we're talking about are extrapolations of current benchmark tasks, so it's more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.

I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren't tested in the benchmarks.

OpenAI: Altman Returns

omegastick2y10

Some people have strong negative priors toward AI in general.

When the GPT-3 API first came out, I built a little chatbot program to show my friends/family. Two people (out of maybe 15) flat out refused to put in a message because they just didn't like the idea of talking to an AI.

I think it's more of an instinctual reaction than something thought through. There's probably a deeper psychological explanation, but I don't want to speculate.

What's your standard for good work performance?

Answer by omegastickSep 28, 202340

Rather than having objective standards, I find a growth-centric approach to be most effective. Optimizing for output is easy to Goodheart, so as much as possible I treat more as a metric than a goal. It's important that I'm getting more done now than I was a year ago, for example, but I don't explicitly aim for a particular output on a day-to-day basis. Instead I aim to optimize my processes and improve my skills, which leads to increased output. That applies not just to good work performance, but many things.

> How much do you get done in a typical month/half year?

Measuring this objectively is hard, but roughly one large project (big new feature, new application, major design overhaul) per month, or more if the projects I'm working on are smaller.
> How much do consider aspirational but realistic to get done in a typical month/half year?

I've managed to get projects that I'd normally finish in a month done in 2 weeks or so by crunching hard, but I'm generally pretty consistent with output on the scale of months/half years. I definitely don't aim for that.
> How much do you consider on the low end but okay to get done in a typical month/half year?

I wouldn't be too upset if a project goes over by 25% due to low output (they can go over longer if there are unexpected issues, but that's another thing). Again though, I'm pretty consistent on the scale of months/half years, so this rarely happens.
> What kind of output would you want to see out of a researcher/community organiser/other independent worker within a month/half a year to be impressed/not be disappointed? (Assuming this is amount is representative of them)

I don't have objective standards here. If I get the impression they are genuinely putting in a good effort and improving with time, I'm happy. Different people have different strengths, and a person might work quite slowly relative to the average, but produce very high quality work. If they continue improving their output, eventually it will be high (for whatever standard of "high" you like). If they're putting in effort and not improving, they might not be in the right line of work, and then I'd be disappointed.
> What's the minimum output would you want to see out of a researcher/community organiser/other independent worker to be in favour of them getting funding to continue their work? (Assuming this is amount is representative of them)

This is a knapsack problem. Calculate the score = (expected output * expected value of work per unit of output) / funding required for each person that needs funding, sort the list in descending order, and allocate funding in order from top to bottom. You don't need to fully solve the knapsack problem here, because leftover funding can be carried over.
> What's the minimum output would you want to see out of your friend to feel good about them continuing their current work? (Assuming this is amount is representative of them)

Their average output over the last 12 months should be higher than their average output over the previous 12, by some non-insignificant amount.

Eliciting Credit Hacking Behaviours in LLMs

omegastick2y10

Agreed, this was an expected result. It's nice to have a functioning example to point to for LLMs in an RLHF context, though.

Is there something fundamentally wrong with the Universe?

omegastick2y10

From one perspective, nature does kind of incentivize cooperation in the long term. See The Goddess of Everything Else.

Defunding My Mistake

omegastick2y86

Is there a reason to believe this is likely? Outside of a strong optimization pressure for niceness (of which there is definitely some, but relative to other optimization pressures it's relatively weak) I'd expect these organizations to be of roughly average possible niceness for their situation.

Any research in "probe-tuning" of LLMs?

omegastick2y20

A quick Google search of probe tuning doesn't turn up anything. Do you have more info on it?

Probe-tuning doesn't train on LLM's own "original rollouts" at all, only on LLM's activations during the context pass through the LLM.

This sounds like regular fine tuning to me. Unless you mean that the loss is calculated based on one (multiple?) of the network's activations rather than on the output logits.

Edit: I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?

The world where LLMs are possible

omegastick2y10

It's also possible that there is some elegant, abstract "intelligence" concept (analogous to arithmetic) which evolution built into us but we don't understand yet and from which language developed. It just turns out that if you already have language, it's easier to work backwards from there to "intelligence" than to build it from scratch.

Why it's so hard to talk about Consciousness

omegastick2y95

This probably isn't the case, but I secretly wonder if the people in camp #1 are p-zombies.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments