Philosopher at the Center for AI Safety

Wiki Contributions


My intention is not to criticize you in particular!

Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it's derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can't do this, I don't criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I've discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn't original — my personal discussions don't constitute part of the collective knowledge of the research community because I haven't shared them publicly.

It's probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.

I should clarify that I think some of Gwern's other points are valuable — I was just quite put off by the beginning of the post.

I'm referring to this exchange:

Christopher King: I believe this has been proposed before (I'm not sure what the first time was).

Gwern:  This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.

Simon Goldstein: Is there other work you can point us to that proposes positively shutdown-seeking agents?

Gwern: No, I haven't bothered to track the idea because it's not useful.

I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can't even remember what the prior proposals were?

The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.

"LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before."

As an academic interested in AI safety and and a relative outsider to LessWrong, I've been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. Comments that do this should be strongly downvoted, and posters that do this should be strongly discouraged. Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato. It seems increasingly clear to me that the LessWrong community does not have adequate accountability mechanisms for preventing superficial engagement with ideas and unproductive discourse. If the community really cares about improving the accuracy of their beliefs, these kinds of things should be a core priority.

If everything is a crux, is anything a crux?

Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph). 

We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don't agree with. 

You write:

It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.

Imagine you're an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren't able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely. 

Every now and then, you get a prompt like "Suppose someone had thus-and-such beliefs and desires. How would they act?" or "Assign an importance score to each of the following sentences." How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers?

In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.

If we're worried about all possible paths from AI progress to human extinction, I think these conditions are too strong. The system doesn't need to be an agent. As avturchin points out, even if it is an agent, it doesn't need to be able to act in the real world. It doesn't even need to be able to accept human extermination as a goal. All that is required is that a human user be able to use it to form a plan that will lead to human extinction. Also, I think this is something that many people working on AI safety have realized — though you are correct that most research attention has been devoted (rather inexplicably, in my opinion) to worries about superintelligent systems.

Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified.

You write: 

One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."

I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the "task queue" submodule of BabyAGI with a function that stores the opposite task it has been given, but returns the opposite task to what it stored. The plain english interpetation of the data is no longer accurate.

The mistake is to assume that because the data inside a language agent takes the form of English words, it precisely corresponds to those words.

I agree that it seems reasonable that it would most of the time, but this isn't something you can say is true always.

Let me clarify that we are not claiming that the architecture of every language agent fixes the functional role of the text it stores in the same way. Rather, our claim is that if you consider any particular language agent, its architecture will fix the functional role of the text it stores in a way which makes it possible to interpret its folk psychology relatively easily.  We do not want to deny that in order to interpret the text stored by a language agent, one must know about its architecture. 

In the case you imagine, the architecture of the agent fixes the functional role of the text stored so that any natural language task-description T represents an instruction to perform its negation, ~T.  Thus the task "Write a poem about existential risk" is stored in the agent as the sentence "Do not write a poem about existential risk," and the architecture of the agent later reverses the negation. Given these facts, a stored instance of "Do not write a poem about existential risk" corresponds to the agent having a plan to not not write a poem about existential risk, which is the same as having a plan to write a poem about existential risk. 

What is important to us is not that the natural language representations stored inside a language agent have exactly their natural language meanings, but rather that for any language agent, there is a translation function recoverable from its architecture which allows us to determine the meanings of the natural language representations it stores.  This suffices for interpretability, and it also allows us to directly encode goals into the agent in a way that helps to resolve problems with reward misspecification and goal misgeneralization.


The idea would be to consider a scenario in which it is something like a law of nature that the predictions of oracles can never be read, in just the same way that the authors are considering a scenario in which it is something like a law of nature that oracles do not and cannot exist.

Instead of having counterfactual oracles operate as though they do not and cannot exist, why not have them operate as though the predictions of oracles can never be read? Designing them in this way would also allow us to escape the following worry from your post:

"...consider a single oracle predicting under the counterfactual that it does not exist.  When it is approached with an important question, it has strong evidence that people want to ask that question to an oracle, and since it does not exist it predicts that a new counterfactual oracle will be built to be asked the question.  This process is repeated recursively, with answers propagating back down the chain until they reach a fixed point, which is then output by the original counterfactual oracle."

If the oracle is operating under the assumption that it does exist but the predictions of oracles can never be read, its evidence that people want to ask questions to an oracle will not give it reason to predict that a second oracle will be constructed (a second one wouldn't be any more useful than the first) or that the answers a hypothetical second oracle might produce would have downstream causal effects. This approach might have advantages over the alternative proposed in the post because counterfactual scenarios in which the predictions of oracles are never read are less dissimilar from the actual world than counterfactual scenarios in which oracles do not and cannot exist.

Load More