Recent Discussion

There's a kind of game here on Less Wrong.

It's the kind of game that's a little rude to point out. Part of how it works is by not being named.

Or rather, attempts to name it get dissected so everyone can agree to continue ignoring the fact that it's a game.

So I'm going to do the rude thing. But I mean to do so gently. It's not my intention to end the game. I really do respect the right for folk to keep playing it if they want.

Instead I want to offer an exit to those who would really, really like one.

I know I really super would have liked that back in 2015 & 2016. That was the peak of my hell in rationalist circles.

I'm watching the game...

I'm not saying it's bad to do these things.

I'm saying that if you're doing them as a distraction from inner pain, you're basically drunk.

How is this falsifiable?

Can you point to five people who have done this, but still have a different orientation from you?

Should society eliminate schools? Should we have more compulsory schooling? Should you send your kids to school? Should you prefer to hire job candidates who have received more schooling, beyond school's correlation with the g factor? Should we consider the spread of education requirements to be a form of class war by the better-educated against the worse-educated which must be opposed for the sake of the worse-educated and the future of society?

2tailcalled17m
How does society decide what subjects get taught in school?
2tailcalled24m
What would happen if society reinstated child labour?

Adults would be a lot more simpler as the time that childhood has time to make its magic would be shorter. More labour supply, lower job complexity and blander humans. I am not super confident with the specifics but quite certain that childhood is doing important effects.

2tailcalled25m
Hm not sure such damage commonly happens.

[Epistemic status: Strong opinions lightly held, this time with a cool graph.]

I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable. 

In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes.

In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the...

Quick self-review:

Yep, I still endorse this post. I remember it fondly because it was really fun to write and read. I still marvel at how nicely the prediction worked out for me (predicting correctly before seeing the data that power/weight ratio was the key metric for forecasting when planes would be invented). My main regret is that I fell for the pendulum rocket fallacy and so picked an example that inadvertently contradicted, rather than illustrated, the point I wanted to make! I still think the point overall is solid but I do actually think this embar... (read more)

This post is a container for my short-form writing. See this post for meta-level discussion about shortform.

20jimrandomh14h
I had the "your work/organization seems bad for the world" conversation with three different people today. None of them pushed back on the core premise that AI-very-soon is lethal. I expect that before EAGx Berkeley is over, I'll have had this conversation 15x. #1: I sit down next to a random unfamiliar person at the dinner table. They're a new grad freshly hired to work on TensorFlow. In this town, if you sit down next to a random person, they're probably connected to AI research *somehow*. No story about how this could possibly be good for the world, receptive to the argument that he should do something else. I suggested he focus on making the safety conversations happen in his group (they weren't happening). #2: We're running a program to take people who seem interested in Alignment and teach them how to use PyTorch and study mechanistic interpretability. Me: Won't most of them go work on AI capabilities? Them: We do some pre-screening, and the current ratio of alignment-to-capabilities research is so bad that adding to both sides will improve the ratio. Me: Maybe bum a curriculum off MIRI/MSFP and teach them about something that isn't literally training Transformers? #3: We're researching optical interconnects to increase bandwidth between GPUs. We think we can make them much faster! Me: What is this I can't even Them: And we're going to give them to organizations that seem like the AI research they're doing is safety research! Me: No you're not, you'll change your mind when you see the money. Also every one of the organizations you named is a capabilities company which brands itself based on the small team they have working on alignment off on the side. Also alignment research isn't bottlenecked on compute. This conference isn't all AI doom and gloom, though. I also met some people from an org that's trying to direct government funding into plant-based meat research. It's nice to see quirky, obscure causes being represented, and it's nice to not *be* the qu

Also every one of the organizations you named is a capabilities company which brands itself based on the small team they have working on alignment off on the side.

I'm not sure whether OpenAI was one of the organizations named, but if so, this reminded me of something Scott Aaronson said on this topic in the Q&A of his recent talk "Scott Aaronson Talks AI Safety":

Maybe the one useful thing I can say is that, in my experience, which is admittedly very limited—working at OpenAI for all of five months—I’ve found my colleagues there to be extremely serious

... (read more)

Many of you are already familiar with Rationalist Winter Solstice, our home-grown winter holiday.  As the year grows literally dark, we gather in our respective communities to face various forms of darkness together, to celebrate what light human civilization has made, and to affirm ourselves as a community of shared values.

This thread is a central place to gather information about specific events.  Please post times, places, registration or rsvp links, restrictions if any, etc.

Since nobody else posted these: 

Bay Area is Sat Dec 17th (Eventbrite) (Facebook)

South Florida (about an hour north of Miami) is Sat Dec 17th (Eventbrite) (Facebook)

Was playing around with chat gpt and and some fun learning about its thoughts on metaphysics. It looks like the ego is an illusion and hedonistic utilitarianism is too narrow minded to capture all of welfare. Instead, it opts for principles of beneficence, non-maleficence, autonomy, and justice. Seems to check out. What do you guys think?

2the gears to ascenscion2h
chatgpt is not a consistent agent; it is incredibly inclined to agree with whatever you ask. it can provide insights, but because it's so inclined to agree, it has far stronger confirmation bias than humans. while its guesses seem reasonable, the hedge it insists on outputting constantly is not actually wrong.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.

One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.

As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.

No one else seems to yet have gathered them together, so here you go. Note...

+1.

I also think it's illuminating to consider ChatGPT in light of Anthropic's recent paper about "red teaming" LMs.

This is the latest in a series of Anthropic papers about a model highly reminiscent of ChatGPT -- the similarities include RLHF, the dialogue setting, the framing that a human is seeking information from a friendly bot, the name "Assistant" for the bot character, and that character's prissy, moralistic style of speech.  In retrospect, it seems plausible that Anthropic knew OpenAI was working on ChatGPT (or whatever it's a beta version of)... (read more)

1Arthur Conmy2h
Not sure if you're aware, but yes the model has a hidden prompt [https://twitter.com/goodside/status/1598253337400717313] that says it is ChatGPT, and browsing is disabled.
6paulfchristiano3h
In addition to reasons other commenters have given, I think that architecturally it's a bit hard to avoid hallucinating. The model often thinks in a way that is analogous to asking itself a question and then seeing what answer pops into its head; during pretraining there is no reason for the behavior to depend on the level of confidence in that answer, you basically just want to do a logistic regression (since that's the architecturally easiest thing to say, and you have literally 0 incentive to say "I don't know" if you don't know!) , and so the model may need to build some slightly different cognitive machinery. That's complete conjecture, but I do think that a priori it's quite plausible that this is harder than many of the changes achieved by fine-tuning. That said, that will go away if you have the model think to itself for a bit (or operate machinery) instead of ChatGPT just saying literally everything that pops into its head. For example, I don't think it's architecturally hard for the model to assess whether something it just said is true. So noticing when you've hallucinated and then correcting yourself mid-response, or applying some kind of post-processing, is likely to be easy for the model and that's more of a pure alignment problem. I think I basically agree with Jacob about why this is hard: (i) it is strongly discouraged at pre-training, (ii) it is only achieved during RLHF, the problem just keeps getting worse during supervised fine-tuning, (iii) the behavior depends on the relative magnitude of rewards for being right vs acknowledging error, which is not something that previous applications of RLHF have handled well (e.g. our original method captures 0 information about the scale of rewards, all it really preserves is the preference ordering over responses, which can't possibly be enough information), I don't know if OpenAI is using methods internally that could handle this problem in theory. This is one of the "boring" areas to improve RLHF (in
3Bill Benzon5h
I'm beginning to think, yes, it's easy enough to get ChatGPT to say things that are variously dumb, malicious, and silly. Though I haven't played that game (much), I'm reaching the conclusion that LLM Whac-A-Mole (モグラ退治) is a mug's game. So what? That's just how it is. Any mind, or mind-like artifact (MLA), can be broken. That's just how minds, or MLAs, are. Meanwhile, I've been having lots of fun playing a cooperative game with it: Give me a Girardian reading of Spielberg's Jaws. I'm writing an article about that which should appear in 3 Quarks Daily on this coming Monday. -------------------------------------------------------------------------------- So, think about it. How do human minds work? We all have thoughts and desires that we don't express to others, much less act on. ChatGPT is a rather "thin" creature, where to "think" it is to express it is to do it. And how do human minds get "aligned"? It's a long process, one that, really, never ends, but is most intense for a person's first two decades. The process involves a lot of interaction with other people and is by no means perfect. If you want to create an artificial device with human powers of mentation, do you really think there's an easier way to achieve "alignment"? Do you really think that this "alignment" can be designed in?

I wonder about a scenario where the first AI with human or superior capabilities would be nothing goal-oriented, eg a language model like GPT. Then one instance of it would be used, possibly by a random user, to make a conversational agent told to behave as a goal-oriented AI. The bot would then behave as an AGI agent with everything that implies from a safety standpoint, eg using its human user to affect the outside world.

Is this a plausible scenario for the development of AGI and the first goal-oriented AGI? Does it have any implication regarding AI safety compared to the case of an AGI designed as goal-oriented from the start?


Thanks to Ian McKenzie and Nicholas Dupuis, collaborators on a related project, for contributing to the ideas and experiments discussed in this post. Ian performed some of the random number experiments.

Also thanks to Connor Leahy for feedback on a draft, and thanks to Evan Hubinger, Connor Leahy, Beren Millidge, Ethan Perez, Tomek Korbak, Garrett Baker, Leo Gao and various others at Conjecture, Anthropic, and OpenAI for useful discussions.

This work was carried out while at Conjecture.

Important correction

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.

The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked...

I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.

If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.