One Life Against the World

Eliezer Yudkowsky

The other side of this post is to look at what various jobs cost. TIme and effort are the usual costs, but some jobs ask for things like willingness to deal with bullshit (a limited resource!), emotional energy, on-call readiness, various kinds of sensory or moral discomfort, and other things.

If you weren't such an idiot...

hazel8mo65

I've been well served by Bitwarden: https://bitwarden.com/

It has a dark theme, apps for everything (including Linux commandline), the Firefox extension autofills with a keyboard shortcut, plus I don't remember any large data breaches.

Killing Socrates

hazel2y113

Part of the value of reddit-style votes as a community moderation feature is that using them is easy. Beware Trivial Inconveniences and all that. I think that having to explain every downvote would lead to me contributing to community moderation efforts less, would lead to dogpiling on people who already have far more refutation than they deserve, would lead to zero-effort 'just so I can downvote this' drive-by comments, and generally would make it far easier for absolute nonsense to go unchallenged.

If I came across obvious bot-spam in the middle of the comments, neither downvoted nor deleted and I couldn't downvote without writing a comment... I expect that 80% of the time I'd just close the tab (and that remaining 20% is only because I have a social media addiction problem).

GPTs are Predictors, not Imitators

hazel2y20

To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. [...] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.

I had assumed that creating on that dataset was a major reason for doing a public release of ChatGPT. "Was this a good response?" [thumb-up] / [thumb-down] -> dataset -> more RLHF. Right?

GPT-4

hazel2y01

Meaning it literally showed zero difference in half the tests? Does that make sense?

GPT-4

hazel2y30

Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it's a somewhat confusing chart.

GPT-4

hazel2y0-1

Green bars are GPT-4. Blue bars are not. I suspect they just didn't retest everything.

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

hazel2y302

So.... they held the door open to see if it'd escape or not? I predict this testing method may go poorly with more capable models, to put it lightly.

And then OpenAI deployed a more capable version than was tested!

They also did not have access to the final version of the model that we deployed. The final version has capability improvements relevant to some of the factors that limited the earlier models power-seeking abilities, such as longer context length, and improved problem-solving abilities as in some cases we've observed.

This defeats the entire point of testing.

I am slightly worried that posts like veedrac's Optimality is the Tiger may have given them ideas. "Hey, if you run it in this specific way, a LLM might become an agent! If it gives you code for recursively calling itself, don't run it"... so they write that code themselves and run it.

I really don't know how to feel about this. On one hand, this is taking ideas around alignment seriously and testing for them, right? On the other hand, I wonder what the testers would have done if the answer was "yep, it's dangerously spreading and increasing it's capabilities oh wait no nevermind it's stopped that and looks fine now".

Scott Aaronson on "Reform AI Alignment"

hazel2y41

At the time I took AlphaGo as a sign that Elizer was more correct than Hanson w/r/t the whole AI-go-FOOM debate. I realize that's an old example which predates the last-four-years AI successes, but I updated pretty heavily on it at the time.

Consume fiction wisely

hazel2y30

I'm going to suggest reading widely as another solution. I think it's dangerous to focus too much on one specific subgenre, or certain authors, or books only from from one source (your library and Amazon do, in fact, filter your content for you, if not very tightly).

LESSWRONG
LW

Posts

Wiki Contributions

Comments