Recent Discussion

ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.

One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.

As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.

No one else seems to yet have gathered them together, so here you go. Note...

My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:

  • The model inherits a strong tendency to make stuff up from the pre-training objective.
  • This tendency is reinforced by the supervised fine-tuning phase, if there are examples of answers containing information that the model doesn't know. (However, this can be avoided to some extent, by having the supervised fine-tuning data depend on what the model seems to know, a technique that was employed here.)
  • I
... (read more)
3tslarm2h
Yeah, in cases where the human is very clearly trying to 'trick' the AI into saying something problematic, I don't see why people would be particularly upset with the AI or its creators. (It'd be a bit like writing some hate speech into Word, taking a screenshot and then using that to gin up outrage at Microsoft.) If the instructions for doing dangerous or illegal things were any better than could be easily found with a google search, that would be another matter; but at first glance they all seem the same or worse. eidt: Likewise, if it was writing superhumanly persuasive political rhetoric then that would be a serious issue. But that too seems like something to worry about with respect to future iterations, not this one. So I wouldn't assume that OpenAI's decision to release ChatGPT implies they believed they had it securely locked down.
1Adam Jermyn2h
Roughly, I think it’s hard to construct a reward signal that makes models answer questions when they know the answers and say they don’t know when they don’t know. Doing that requires that you are always able to tell what the correct answer is during training, and that’s expensive to do. (Though Eg Anthropic seems to have made some progress here: https://arxiv.org/abs/2207.05221 [https://arxiv.org/abs/2207.05221]).
2habryka4h
No, the "browsing: enabled" is I think just another hilarious way to circumvent the internal controls.

The goal of this post is mainly to increase the exposure of the AI alignment community to Active Inference theory, which seems to be highly relevant to the problem but is seldom mentioned on the forum.

This post links to a freely available book about Active Inference, published this year. For alignment researchers, the most relevant chapters will be 1, 3, and 10.


Active Inference as a formalisation of instrumental convergence

Active Inference is a theory describing the behaviour of agents that want to counteract surprising, “entropic” hits from the environment via accurate prediction and/or placing themselves in a predictable (and preferred) environment.

Active Inference agents update their beliefs in response to observations (y), update the parameters and shapes of their models Q and P (which can be seen as a...

I don't know what the "underlying real probability" is (no condescendence in this remark; I'm genuinely confused about the physics and philosophy of probability and haven't got time to figure it out for myself, and I'm not sure this is a settled question).

Both P and Q are something that is implemented (i. e., encoded in some way) by the agent itself. The agent knows nothing about the "true generative model" of the environment (even if we can discuss it; see below). The only place where "the feedback from the environment" enters this process is in the calcu... (read more)

I recently watched Squid Game—both the original Netflix series and MrBeast's real life version. Then I watched every video I could find about how MrBeast's Squid Game was made. I was surprised to learn that MrBeast used a whole bunch of CGI (computer-generated imagery). I had taken its realty at face value.

There are four places I know of where MrBeast used CGI.

  • The sky above the big dirt field wasn't originally sky. The scenes were actually filmed in an indoor stadium.
  • The smoke effects of players who got eliminated were added in post-production.
  • A dust effect where a woman fell was CGI.
  • The big platforms for Tug-of-War and Glass Bridge weren't real at all. Both events took place close to the ground. In those scenes, almost everything the players didn't physically
...

The big platforms for Tug-of-War and Glass Bridge weren't real at all.

I should have realized then. I noticed my confusion ("If that drop was 20 meters and those walls are only ~3 feet high, I would be terrified and not casually walking by the edge like those guys"). But I failed to think of any hypotheses that fit the data.

Added: Ah, your explanation for why you fell for it makes perfect sense. I was so used to knowing it was real, that I didn't notice the one time it wasn't.

(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)

 

Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.

Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.

We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there...

4Rob Bensinger1h
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act [https://arbital.com/p/pivotal/] that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period [https://www.danieldewey.net/fast-takeoff-strategies.html]. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities [https://arbital.com/p/minimality_principle/], out of the set of acts that suffice to prevent the world from being destroyed: Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI. MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?" My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing [https://intelligence.org/late-2021-miri-conversations/] of their models [https://www.lesswrong.com/s/v55BhXbpJuaExkpcD], while keeping an eye out for hopeful-seeming ideas. * If an alignment idea strikes us as having even a tiny scrap of hope, and isn't already funding-saturated, then we're making sure it gets funded. We don't care whether that happens at MIRI versus elsewhere — we're just seeking to maximize the amount of good work that's happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety [https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment#High_level_view] of different moonshots and speculative ideas that are targeted at the core

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.

FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have... (read more)

1Igor Ivanov9h
I agree In my opinion, this methodology will be a great way for a model to learn how to persuade humans and exploit their biases because this way model might learn these biases not just from the data it collected but also fine-tune its understanding by testing its own hypotheses
12Daniel Kokotajlo14h
I'm happy to see OpenAI and OpenAI Alignment Team get recognition/credit for having a plan and making it public. Well deserved I'd say. (ETA: To be clear, like the OP I don't currently expect the plan to work as stated; I expect us to need to pivot eventually & hope a better plan comes along before then!)

I wanted to ask ChatGPT how to optimize a few normal routines. One of my questions was about how to brush teeth. My conversation with ChatGPT:

What's the best way to brush teeth?

The best way to brush teeth is to use a soft-bristled toothbrush and fluoride toothpaste. To brush your teeth effectively, follow these steps:

  1. Wet your toothbrush and apply a small amount of toothpaste to the bristles.
  2. Place the toothbrush at a 45-degree angle against the gums and move the brush back and forth in a gentle circular motion to clean the teeth and gums.
  3. Brush the outer surfaces of all of your teeth, using the same back-and-forth motion.
  4. Use the tip of the toothbrush to clean the inside surfaces of the front teeth, using a gentle up-and-down motion.
  5. To clean the
...
1Lao Mein4h
Is there a benefit to using a tongue scraper over just using your teeth? I instinctively do it when I feel buildup on my tongue.

Most of the buildup is in the very back

2ChristianKl7h
ChatGTP is not yet good enough to solve every problem that you throw at it on it's own, but it can help you with brainstorming what might be happening with your problem. ChatGPT can also correctly answer questions like "Write a Wikidata SPARQL query that shows all women who are poets and who live in Germany" It's again an easy-to-verify answer but it's an answer that allows you to research further. The ability to iterate in a fast matter is useful in combination with other research steps.
1GregK2h
This is probably key. If GPT can solve something much faster that's indeed a win. (With the SPARQL example I guess it would take me 10-20 minutes to look up the required syntax and fields, and put them together. GPT cuts that down to a few seconds, this seems quite good.) My issue is that I haven't found a situation yet where GPT is reliably helpful for me. Maybe someone who has found such situations, and reliably integrated "ask GPT first" as a step into some of their workflows could give their account? I would genuinely be curious about practical ways people found to use these models. My experience has been quite bad so far unfortunately. For example I tried to throw a problem at it that I was pretty sure didn't have an easy solution [https://www.lesswrong.com/posts/goC9qv4PWf2cjfnbm/did-chatgpt-just-gaslight-me?commentId=66FeeGkX3e37zhJhp] , but I just wanted to check that I didn't miss anything obvious. The answer I would expect in this case is "I don't know of any easy solution", but instead I got pages of hallucinated BS. This is worse than if I just hadn't asked GPT at all since now I have to waste my time reading through its long answers just to realize it's complete BS.

Bets and bonds are tools for handling different epistemic states and levels of trust. Which makes them a great fit for negotiating with small children!

A few weeks ago Anna (4y) wanted to play with some packing material. It looked very messy to me, I didn't expect she would clean it up, and I didn't want to fight with her about cleaning it up. I considered saying no, but after thinking about how things like this are handled in the real world I had an idea. If you want to do a hazardous activity, and we think you might go bankrupt and not clean up, we make you post a bond. This money is held in escrow to fund the cleanup if you disappear. I explained how this worked, and she went...

As I said in my original comment here, I'm not a parent, so I didn't get a chance to try this. But now I work at a kindergarten, and was reminded of this post by the review process, so I can actually try it! Expect another review after I do :)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

The text that follows is a copy-paste from the survey creator's Facebook post, shared with permission.

Asking all Cryonicists and those considering it or curious about it to... ~ Please take the Great Cryonics Survey of 2022 ~

Go to --> bit.ly/cryosurvey2022

Take the survey if:

  1. You're signed up for Cryonics
  2. You think Cryonics is interesting but you aren't ready to sign up yet.
  3. You've decided to sign up but haven't gotten around to it yet.

The goal of this survey is to understand what makes Cryonicists tick using data rather than anecdotes. Here are some examples of the kinds of issues this survey is meant to understand:

  1. How many Cryonicists are in favor of "mind uploading" and how many are against it?
  2. How do Cryonicists think about death?
  3. How many Cryonicists suffer from some form of death anxiety?
  4. How
...

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which dec... (read more)

Definition. On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.” 

Summaries of key points:

  1. Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don't have to be "globally robust" or "perfect."
  2. Values steer optimization; they are not optimized against. The value shards aren't getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API). 

    Since values are not the
...
2cfoster011h
I don't think that is what is happening. I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts [https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8/p/zXibERtEWpKuG5XAC] to strengthen & broaden the connections responsible for positive value errors. TD learning (especially with temporal abstraction [https://ai.stackexchange.com/questions/13254/what-are-options-in-reinforcement-learning] ) lets the agent immediately update its behavior based on predictive/associational representations, rather than needing the much slower reward circuits to activate. You know the feeling of "Oh, that idea is actually pretty good!"? In my book, that ≈ positive TD error. A diamond shard is downstream of representations like "shiny appearance" and "clear color" and "engagement" and "expensive" and "diamond" and "episodic memory #38745", all converging to form the cognitive context that inform when the shard triggers. When the agent imagines a possible plan like "What if I robbed a jewelry store?", many of those same representations will be active, because "jewelry" spreads activation into adjacent concepts in the agent's mind like "diamond" and "expensive". Since those same representations are active, the diamond shard downstream from them is also triggered (though more weakly than if the agent were actually seeing a diamond in front of them) and bids for that chain of thought to continue. If that robbery-plan-thought seems better than expected (i.e. creates a positive TD error) upon consideration, all of the contributing concepts (including concepts like "doing step-by-step reasoning") are immediately upweighted so as to reinforce & generalize their firing pattern into future. In my picture there is no separate "planner" component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior. Planning is the name for running shard dynamics forward while loop

I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors

Interesting, I'll look into TD learning in more depth later. Anecdotally, though, this doesn't seem to be quite right. I model shards as consciously-felt urges, and it sure seems to me that I can work towards anticipating and satisfying-in-advance these urges without actually feeling them.

To quote the post you linked:

For example, imagin

... (read more)