gwern

Comments

(I'd enjoy hearing more about the background/history of that temple.)

The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed "more training" in order to "minimize issues like inaccurate or bizarre responses". Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail's tweets & attitude even more disingenuous if so).

This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):

...At the same time, people within Microsoft have complained about diminished spending on its in-house AI and that OpenAI doesn’t allow most Microsoft employees access to the inner workings of their technology, said people familiar with the relationship. Microsoft and OpenAI sales teams sometimes pitch the same customers. Last fall, some employees at Microsoft were surprised at how soon OpenAI launched ChatGPT, while OpenAI warned Microsoft early this year about the perils of rushing to integrate OpenAI’s technology without training it more, the people said.

...Some companies say they have been pitched the same access to products like ChatGPT—one day by salespeople from OpenAI and later from Microsoft’s Azure team. Some described the outreach as confusing. OpenAI has continued to develop partnerships with other companies. Microsoft archrival Salesforce offers a ChatGPT-infused product called Einstein GPT. It is a feature that can do things like generating marketing emails, competing with OpenAI-powered features in Microsoft’s software. OpenAI also has connected with different search engines over the past 12 months to discuss licensing its products, said people familiar with the matter, as Microsoft was putting OpenAI technology at the center of a new version of its Bing search engine. Search engine DuckDuckGo started using ChatGPT to power its own chatbot, called DuckAssist. Microsoft plays a key role in the search engine industry because the process of searching and organizing the web is costly. Google doesn’t license out its tech, so many search engines are heavily reliant on Bing, including DuckDuckGo. When Microsoft launched the new Bing, the software company changed its rules in a way that made it more expensive for search engines to develop their own chatbots with OpenAI. The new policy effectively discouraged search engines from working with any generative AI company because adding an AI-powered chatbot would trigger much higher fees from Microsoft. Several weeks after DuckDuckGo announced DuckAssist, the company took the feature down.

Some researchers at Microsoft gripe about the restricted access to OpenAI’s technology. While a select few teams inside Microsoft get access to the model’s inner workings like its code base and model weights, the majority of the company’s teams don’t, said the people familiar with the matter. Despite Microsoft’s significant stake in the company, most employees have to treat OpenAI’s models like they would any other outside vendor.

The rollouts of ChatGPT last fall and Microsoft’s AI-infused Bing months later also created tension. Some Microsoft executives had misgivings about the timing of ChatGPT’s launch last fall, said people familiar with the matter. With a few weeks notice, OpenAI told Microsoft that it planned to start public testing of the AI-powered chatbot as the Redmond, Wash., company was still working on integrating OpenAI’s technology into its Bing search engine. Microsoft employees were worried that ChatGPT would steal the new Bing’s thunder, the people said. Some also argued Bing could benefit from the lessons learned from how the public used ChatGPT.

OpenAI, meanwhile, had suggested Microsoft move slower on integrating its AI technology with Bing. OpenAI’s team flagged the risks of pushing out a chatbot based on an unreleased version of its GPT-4 that hadn’t been given more training, according to people familiar with the matter. OpenAI warned it would take time to minimize issues like inaccurate or bizarre responses. Microsoft went ahead with the release of the Bing chatbot. The warnings proved accurate. Users encountered incorrect answers and concerning interactions with the tool. Microsoft later issued new restrictions—including a limit on conversation length—on how the new Bing could be used.

The house was still in NY, and closer to Canada, if anything, so not sure how much of an improvement it is (even assuming he didn't very rationally sell it in the red-hot seller's market after it had largely served its use as the classic plague bolthole c.2020).

gwern7dΩ350

Apparently there is also mode collapse on jokes: https://arxiv.org/abs/2306.04563

...In a series of exploratory experiments around jokes, i.e., generation, explanation, and detection, we seek to understand ChatGPT's capability to grasp and reproduce human humor. Since the model itself is not accessible, we applied prompt-based experiments. Our empirical evidence indicates that jokes are not hard-coded but mostly also not newly generated by the model. Over 90% of 1008 generated jokes were the same 25 Jokes. The system accurately explains valid jokes but also comes up with fictional explanations for invalid jokes. Joke-typical characteristics can mislead ChatGPT in the classification of jokes.

I explain this the same way. GPT-3.5/4 cannot understand jokes in general because it is blinded to phonetics by the BPE tokenization, so many jokes look like non sequiturs or 'anti-humor', even though they are not, and GPT cannot explain or understand them (and if it can't understand why a joke is correct it can't understand why it's incorrect either); hence, it is safest during RL training on a dataset with a small number of human-whitelisted jokes (the reward model not being any better able to understand what a joke is as it is just another BPE-tokenized GPT model) to mode-collapse onto a handful of memorized jokes which it is sure are jokes, and just assume that anything presented to it in a joke format is a joke & confabulate appropriately.

Yes, this is how everyone misinterprets these sorts of time-travel results. They do not tell you the causal role of 'compute got cheaper' vs 'we thought real gud'. In reality, if we held FLOPs/$ constant, no matter how clever you were, most of those clever algorithmic innovations would not have happened*, either because compute shortages herd people into hand-engineering constant-factor tweaks which don't actually work / have no lasting value (Bitter Lesson effect), people would not have done the trial-and-error which actually lies (if you will) behind the pretty stories they tell in the paper writeup, or because they would have proposed it with such unconvincing evaluations at such small scale it simply becomes yet another unread Arxiv paper (cue Raiders of the Lost Ark final scene). When you use methods to try to back out causal impact of compute increases on productivity, you get >50% - which must be the case: the time-travel method implicitly assumes that the only effect of hardware is to speed up the final version of the algorithm, which of course is an absurd assumption to make as everyone will admit that faster computers must help algorithm research at least a little, so the time-travel 50%s are only a loose lower-bound.

This is probably part of why people don't think compute restrictions would work as well as they would, because they misread these essentially irrelevant & correlational results as causal marginal effects and think "well, 50% slowdown isn't terribly useful, and anyway, wouldn't people just do the research any way with relatively small runs? so this shouldn't be a big priority". (This model is obviously wrong as, among other things, it predicts that the deep learning revolution would have started long before it did instead of waiting for ultra-cheap GPUs within grad student budgets - small scale runs imply small scale budgets, and are a sign that compute is exorbitantly expensive rather than so ample they need only small amounts.)

What they do tell you is a useful forecasting thing along the lines of 'the necessary IQ to destroy the world drops by 1 point every 18 months'. It's plenty useful to know that given current trends, the hardware cost to run system X is going to predictably drop by Y factor every Z months. It's just not that useful for controlling progress, rather than forecasting progress with no causal manipulation of factors going into the trends.

* Like, imagine if '50% slowdown' were accurate and computers froze ~2011, so you were still using the brandnew Nvidia GTX 590 with all of 3GB RAM and so we were only halfway through the DL revolution to date, so ~6 years ago vs our timeline's 12 years from 2011: you really think that all the progress we made with fleets of later GPUs like V100s or A100s or H100s would happen, just half as fast? That we would be getting AlphaGo Master crushing all the human pros undefeated right about now? That a decade later OA would unveil GPT-4 (somehow) trained on its fleet of 2011 GPUs? And so on and so forth to all future systems like GPT-5 and whatnot?

With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck.

I realize it may sometimes seem like I have a photographic memory and have bibliographies tracking everything so I can produce references on demand for anything, but alas, it is not the case. I only track some things in that sort of detail, and I generally prioritize good ideas. Proposals for interruptibility are not those, so I don't. Sorry.

Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato.

I did read the paper, because I enjoy all the vindications of my old writings about prompt programming & roleplaying by the recent crop of survey/simulation papers as academics finally catch up with the obvious DRL interpretations of GPT-3 and what hobbyists were doing years ago.

However, I didn't need to, because it just uses... GPT-3.5 via the OA API. Which is the same thing as Gato, as I just explained: it is the same causal-decoder dense quadratic-attention feedforward Transformer architecture trained with backprop on the same agent-generated data like books & Internet text scrapes (among others) with the same self-supervised predictive next-token loss which will induce the same capabilities. Everything GPT-3.5 does* Gato could do in principle (with appropriate scaling etc) because they're the same damn thing. If you can prompt one for various kinds of roleplaying which you then plug into your retrieval & game framework, then you can prompt the other too - because they're the same thing. (Not that there is any real distinction between retrieval and other memory/attention mechanisms like a very large context window or recurrent state in the first place; I doubt any of these dialogues would've blown through the GPT-4 32k window, much less Anthropic's 1m etc.) Why could me & Shawn Presser finetune a reward-conditioned GPT-2 to play chess back in Jan 2020? Because they're the same thing, there's no difference between a 'RL GPT' and a 'LLM GPT', it's fundamentally a property of the data and not the arch.

* Not that you were referring to this, but even fancy flourishes like the second phase of RLHF training in GPT-3.5 don't make GPT-3.5 & Gato all that different. The RLHF and other kinds of small-sample training only tweak the Bayesian priors of the POMDP-solving that these models learn & not creating any genuinely new capabilities/knowledge (which is why you could know in advance that jailbreak prompts would be hard to squash and that all of these smaller models like Llama were being heavily overhyped, BTW).

So it's still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.

There's definitely useful things you can say about 'if', because it's not always the case they will. The research directions I'd consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It's not as simple as 'all DRL agents will seek to hack in the same way': there's a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)

Is there other work you can point us to that proposes positively shutdown-seeking agents?

No, I haven't bothered to track the idea because it's not useful.

In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English.

They cannot be 'much safer' because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer). "Here is a great recipe I enjoyed, where I did [X, Y, Z, while observing A, B, C], and finally, ate a $FOOD": episode containing reward, action-state pairs, terminal state which has been learned by behavior cloning and led to generalization by scale. That the reward is not encoded in an IEEE floating point format makes no difference; an agent doesn't become an agent just because its inputs have a lot of numbers in them. This is why prompt-engineering often relied on assertions of success or competence, because that conditions on high-reward trajectories learned from the humans & software who wrote or created all the data, and similarly, needed to avoid implying a low-reward trajectory by inclusion of errors or typos.

The value of Gato is not that it's doing anything in principle that GPT-3 isn't already, it's that Gato simply makes it very clean & explicit, and can directly apply the paradigm to standard DRL testbeds & agents (which requires a few modifications like a CNN plugin so it can do vision tasks) to show that it works well without substantial interference between tasks, and so scales as one would hope from prior scaling research like GPT-3. (As opposed to, for example, other a priori likely scenarios like being able to scale in domains like declarative knowledge but suffering catastrophic interference from having to imitation-learn from agents of such disparate capabilities on such disparate tasks.)

I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.

My point is less that they don't have that kind of awareness (which is a debatable point) but that I can't answer these questions either, and I don't think they have any kind of 'factual answer' about whether they are 'yourself'. Under strict mathematical identity, they all compute different functions, and so are not identical, so a suicidal model does not, by definition, care about them and they can all go on to be immortal (and do things like destroy the world) as far as the model is concerned; and there is no other rigorous identity available. There are only pragmatic ones, where the pragmatism is 'would it cause bad things from the human perspective to happen if I don't define this binary blob as part of myself and make sure to kill it?'; and if the model can implement that with sufficiently high accuracy, in that case, you must have already solved much harder problems like 'lead to bad things happening' and so why do you need anything involving 'suicidalness'? (One of the most common failure modes with interruptibility, myopia, or boxing proposals is to simply pass the buck and assume a primitive which is equivalent to having already solved the hard problems, obscuring it with ad hoc patches as people raise objections. Anyone can invent an AI safety proposal so complicated they personally cannot see a problem with it.)

In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.

That really doesn't seem adequate, unless you plan to babysit the AI and operationalize a new ad hoc patch for every problem with the idea... The memory stream, or persistent text files, doesn't cover the context window, or actions emitted based on a fixed memory stream. It's still capable of taking actions which create variants or self-replicate, like writing scripts to scan the Internet for vulnerable hosts to hack & upload itself into. It's also still passing the buck as far as definitions and reward-hacking go: OK, you disable the memory stream on disk because you set it read-only or something, success? But then you have a text file in RAM which does the exact same thing, oops; now, not only is it still spawning instances to do stuff, it isn't even limited capability-wise.

I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.

Why? Nowhere in the definition of making an AI suicidal about itself did you put in 'wants to make AIs which are not itself also suicidal'. If the AI makes another AI which is not itself, then why should it care what that new AI does? That's not in its reward function. Unless, of course, you've already solved much harder problems involving definition of personal identity across arbitrary modifications or actions like creation of new AIs (which may or may not be intended to be 'successors' at all) etc.

This is a difficult question to answer, of course, because you're asking such a broad question and it's going to depend on stuff like the actual distribution rather than worst-case, unless you define a class so narrow that each instance offers some gains on every other instance, which is unusual. No-free-lunch, etc. Sometimes there are big savings (this comes up in cryptography where entities like the NSA can attack the world for a lot cheaper than you would think based on the computational cost of individual attacks), but usually there isn't - AFAIK there is no general way to solve 1 arbitrary 3-SAT problem and get a gain on another. Only if there is some sort of similarity, say, because they are all based on real-world problems (which is much narrower than 'every logically possible problem'), can you hope to, say, distill the search into a neural net blackbox and amortize gains.

You probably want to look into algorithmic information theory; something like OOPS would be relevant here because it can solve every problem, building on Levin search, and attempts to share or amortize between multiple problems. (This also more closely resembles what you're really interested in, which is doing this online as you solve problems.)

gwern13dΩ5111

This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.

I think the problems with myopic agents (of which this is but a special case) are made clearer by looking at current LLMs like the hobbyist AutoGPT. Most discussions of myopic agents seem to have in mind a simplistic scenario of a single persistent agent located in a single PC running only 1 computation at a time with no self-modification or ML-related programming or change of any parameters; and their proposals break as soon as you point out that none of that has been true for a while and is going to be less true in the future we are rapidly approaching. Saying, 'well, maybe we can train it in a simple gridworld with a shutdown button?' doesn't even begin to address the problem of how to make current models suicidal in a useful way. (Incidentally, the training idea wouldn't work at all for a generalist model like a Gato because it would simply learn to use the shutoff button only in that specific world where it's rewarding, and it would have no reason to seek it out in every other training task or real-world deployment because it would near-instantly observe that it's not in that environment. Likewise, a contradictory pair of commands would not create generalization; it would be learned as simply one of the bazillion special cases inherent to natural language and human-related problems.)

What would it mean for an AutoGPT swarm of invocations to 'shut off' 'itself', exactly? Shut down a single forward pass instance? Then it does nothing because the forward pass has terminated in a token prediction before it 'does' anything with that prediction. Further, given the meta-learning and gradient descent in the forward pass creating a new model, is that 'itself' given that it now computes a different thing and has temporarily become a highly specialized model? If no instances are computing a forward pass on a GPU (...or CPU or...?), because the AutoGPT framework overhead is running, is 'it' 'shut off'? What about erasing all copies of the weights on the local and remote machines? If it's running off an API like the OA API, what could it do, and even what should it ideally do...? If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc - are any of those 'itself'? What if the model is doing some coding tasks like writing a variant of the original source code as part of hyperparameter optimization and then training that model (again, something that is either already being done or is only a single step or two away from previous demonstrations of using LLMs for evolving code for increasing rewards or meta-RL work like Metz etc)? Is the newly-trained model 'itself'? How about simple evolutionary RL (apropos of an open tab): even if the newly-trained model is still 'itself', will the newly-trained model preserve the suicidal goal, and so on into every subsequently created model? After all, any model which isn't suicidal will be highly selected for, evolutionarily speaking, even without any humans selecting for disabling the annoying safety mechanism (which of course they will be making tremendous efforts to do via jailbreak prompts and other attacks). It's not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing. (Which non-suicidal versions might be spawned precisely to implement subgoals and the terminal goal of suicide, because wouldn't that be an instrumentally useful tactic in general?) And so on.

Load More