How should AI
systems behave, and
who should decide? [OpenAI blog]

ShardPhoenix

This is a linkpost for https://openai.com/blog/how-should-ai-systems-behave/

OpenAI writes a (vague) blog explaining how they plan to adjust ChatGPT from here. Their key plans are:

1. Improve default behavior. We want as many users as possible to find our AI systems useful to them “out of the box” and to feel that our technology understands and respects their values.

Towards that end, we are investing in research and engineering to reduce both glaring and subtle biases in how ChatGPT responds to different inputs. In some cases ChatGPT currently refuses outputs that it shouldn’t, and in some cases, it doesn’t refuse when it should. We believe that improvement in both respects is possible.

Additionally, we have room for improvement in other dimensions of system behavior such as the system “making things up.” Feedback from users is invaluable for making these improvements.

2. Define your AI’s values, within broad bounds. We believe that AI should be a useful tool for individual people, and thus customizable by each user up to limits defined by society. Therefore, we are developing an upgrade to ChatGPT to allow users to easily customize its behavior.

This will mean allowing system outputs that other people (ourselves included) may strongly disagree with. Striking the right balance here will be challenging–taking customization to the extreme would risk enabling malicious uses of our technology and sycophantic AIs that mindlessly amplify people’s existing beliefs.

There will therefore always be some bounds on system behavior. The challenge is defining what those bounds are. If we try to make all of these determinations on our own, or if we try to develop a single, monolithic AI system, we will be failing in the commitment we make in our Charter to “avoid undue concentration of power.”

3. Public input on defaults and hard bounds. One way to avoid undue concentration of power is to give people who use or are affected by systems like ChatGPT the ability to influence those systems’ rules.

We believe that many decisions about our defaults and hard bounds should be made collectively, and while practical implementation is a challenge, we aim to include as many perspectives as possible. As a starting point, we’ve sought external input on our technology in the form of red teaming. We also recently began soliciting public input on AI in education (one particularly important context in which our technology is being deployed).

We are in the early stages of piloting efforts to solicit public input on topics like system behavior, disclosure mechanisms (such as watermarking), and our deployment policies more broadly. We are also exploring partnerships with external organizations to conduct third-party audits of our safety and policy efforts.

They still have a use for human reviewers. But if eventually something like RLAIF (Constitutional AI) (which is referenced in the blog post) becomes good enough, fine-tuning/RLHF might get fully automated. Having any sufficiently reliable LLM character available to replace human reviewers might be all it takes, and ChatGPT might already be there (if given appropriate instructions, which seems like the ingredient that's more likely to be missing), and it's not even based on GPT-4. Then, the reviewer character could be used for fine-tuning/RLHF of an SSL pre-trained model based on fixed documents detailing reviewer instructions and the target character definition.

The reviewer character acts as a "compiler", turning "source code" into a runnable "program", making this process automatic and reproducible, starting from raw datasets. The reviewer character is itself a "program" and could bootstrap itself from its own "source code", once there is any running version that can manage to perform the "compilation" process. Human reviewers perform the initial "manual compilation", to get the first running "compiler". (This casts Thompson's Reflections on Trusting Trust in a new light.)

So there is a hypothetical option of automatically running the fine-tuning process for a character based on a character definition (that's probably only a hypothetical at the moment). Perhaps that's what they are gesturing at, some intermediate step beyond what they currently offer? The blog post does sound rather vague.

Note that OpenAI already provides fine-tuning API and it's not difficult or expensive to use the API to influence AI's values. See RightWingGPT for an example.

RightWingGPT post also demonstrates that despite OpenAI's insistence "our guidelines are explicit that reviewers should not favor any political group", ChatGPT has clear political bias and the process is failing. (Or, more likely, the process is working as designed and OpenAI is lying here.)

Note that OpenAI already provides fine-tuning API and it's not difficult or expensive to use the API to influence AI's values. See RightWingGPT for an example.

LESSWRONG
LW

LESSWRONG
LW

22

How should AI systems behave, and who should decide? [OpenAI blog]

22

22

22