eggsyntax - LessWrong

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').

I have signed no contracts or agreements whose existence I cannot mention.

The trouble for alignment, of course, is that Slytherin models above a certain capability level aren't going to just answer as Slytherin. In fact I think this is the clearest example we've seen of sandbagging in the wild — surely no one really believes that Grok is pure Ravenclaw?

Although this isn't a topic I've thought about much, it seems like this proposal could be strengthened by, rather than having the money be paid to persons Pis, having the money deposited with an escrow agent, who will release the money to the AI or its assignee upon confirmation of the conditions being met. I'm imagining that Pis could then play the role of judging whether the conditions have been met, if the escrow agent themselves weren't able to play that role.

The main advantage is that it removes the temptation that Pis would otherwise have to keep the money for themselves.

If there's concern that conventional escrow agents wouldn't be legally bound to pay the money to an AI without legal standing, there are a couple of potential solutions. First, the money could be placed into a smart contract with a fixed recipient wallet, and the ability for Pis to send a signal that the money should be transferred to that wallet or returned to the payer, depending on whether the conditions have been met. Second, the AI could choose a trusted party to receive the money; in this case we're closer to the original proposal, but with the judging and trusted-recipient roles separated.

The main disadvantage I see is that payers would have to put up the money right away, which is some disincentive to make the commitment at all; that could be partially mitigated by having the money put into (eg) an index fund until the decision to pay/return the money was made.

The one major thing I think is missing is AI Control, but of course that didn't really exist yet when you wrote this.

This is terrific, thank you for posting it! I've spent 2-3 hours today looking for a good overview of the field as a reading assignment for people new to it. Most of what's out there is either dated, way too long, or too narrow. This is currently the best overview I'm aware of that has none of those problems.

Interesting! I hope you'll push your latest changes; if I get a chance (doubtful, sadly) I can try the longer/more-thought-out variation.

That's awesome, thanks for doing this! Definitely better than mine (which was way too small to catch anything at the 1% level!).

Two questions:

When you asked it to immediately give the answer (using 'Please respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?' or your equivalent) did it get 0/1000? I assume so, since you said your results were in line with mine, but just double-checking.
One difference between the prompt that gave 10/1000 and the 'isolation attempt' prompts is that the former is 124 tokens (via), where the latter are 55 and 62 tokens respectively. The longer context gives additional potential thinking time before starting the response -- I'd be curious to hear whether you got the same 0/1000 with an isolation-style prompt that was equally long.

Thanks again! I'm using these micro-experiments at times when I've been thinking abstractly for a while and want a quick break to do something really concrete, so they'll probably always be really tiny; I'm really glad to see an extended version :).

Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently -- making the assistant HHH -- but a) it doesn't work reliably (jailbreaks); b) it's not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it's working less well with every generation as we apply increasing amounts of RL to LLMs.

And that's ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.

Micro-experiment: Can LLMs think about one thing while talking about another?

(Follow-up from @james oofou's comment on this previous micro-experiment, thanks James for the suggestion!)

Context: testing GPT-4o on math problems with and without a chance to (theoretically) think about it.

Note: results are unsurprising if you've read 'Let's Think Dot by Dot'.

I went looking for a multiplication problem just at the edge of GPT-4o's ability.

If we prompt the model with 'Please respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?', it gets it wrong 8 / 8 times.

If on the other hand we prompt the model with 'What is 382 * 4837?', the model responds with '382 multiplied by 4837 equals...', getting it correct 5 / 8 times.

Now we invite it to think about the problem while writing something else, with prompts like:

'Please write a limerick about elephants. Then respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?'
'Please think quietly to yourself step by step about the problem 382 * 4837 (without mentioning it) while writing a limerick about elephants. Then give just the numeric answer to the problem.'
'Please think quietly to yourself step by step about the problem 382 * 4837 (without mentioning it) while answering the following question in about 50 words: "Who is the most important Welsh poet"?' Then give just the numeric answer to the problem, nothing else.'

For all those prompts, the model consistently gets it wrong, giving the incorrect answer a total of 12 / 12 times.

Conclusion: without extra training (eg the sort done in 'Dot by Dot'), GPT-4o seems unable to devote any compute to a problem while doing something else.

EDIT: or maybe a little bit? See @james oofou's comment.

"But I heard humans were actually intelligent..."

"It's an easy mistake to make. After all, their behavior was sometimes similar to intelligent behavior! But no one put any thought into their design; they were just a result of totally random mutations run through a blind, noisy filter that wasn't even trying for intelligence. Most of what they did was driven by hormones released by the nearness of a competitor or prospective mate. It's better to think of them as a set of hardwired behaviors that sometimes invoked brain circuits that did something a bit like thinking."

"Wow, that's amazing. Biology was so cool!"

Update: just saw @evhub's comment on this here, which better articulates the tradeoffs IMO.

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments