# Ω 4

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

## Background

So I'm thinking that AI-assisted summarization, math, bug-finding in code, and logical-error finding in writing is at a point where it is quite useful, if we can improve the tooling/integration a little bit.

In code I've found it helpful to comment out some lines and write // WRONG: above them and // FIXED VERSION: below them then let copilot try a few things.

For writing you could take a paragraph excerpt and write a critique post: "John Smith wrote '...' This immediately strikes me as absurd because"

Imagine you were doing chemistry research in uh 1650 and had direct immediate written feedback from uh Robert Boyle on directions to pursue, dead ends, errors, etc except that 75% of the time he says something backwards or is just pattern matching. I think you might still do much better work than you would've without noisy-Boyle.

I'm not aware of anyone trying to actually use LLMs for meaningful writing/thinking assistance so I decided to try. I wrote the below text in about an hour. Consider this merely a demonstration that you can get a decent amount of semi-meaningful content in the right direction quite quickly. It's rare you can expect that much from someone.

## List

Rob Bensinger gave this suggestion in a comment:

I think that this is a really good exercise that more people should try: Imagine that you’re running a project yourself that’s developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you’re not the perfect person for the job; that’s a sunk cost. Just think about what specific things you would actually do to make things go well, what things you’d want to do to prepare 2 years or 6 years in advance, etc.

Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it’s near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.

I took the above prompt together with below underlined text and used BLOOM to generate ten concrete things to do right now. For each item, I generated three completions and picked my favorite one then did bit of editing. I had trouble editing the math into something reasonable so consider it creative inspiration.

(Repeat, my writing / prompting is underlined and AI completion is not:)

Here, I will try it right now.  I'm in charge of OpenMind and it is clear from our rate of progress that we will have AGI in less than 24 months. It's been decided that we will build it and deploy it but I have some influence on additional efforts we can take to reduce the risk. Here's ten things I would try:

1. For starters, I would try to test each new model in an isolated (ie not internet connected) environment with constrained IO channels for a month before deploying it.
2. I would also use any and all interpretability methods available to inspect the model and try to predict off-distribution behavior. If I had the sway, I might run a red-team blue-team contest to see how good we are at predicting off-distribution.
3. There are some old posts about safety implications of on-policy vs off-policy RL that I read but did not fully digest. I would go back and absorb those more fully and see if the distinction can be adapted to our use case. (Off-policy was much better in some ways if I recall correctly)
4. I'm pretty strongly in favor of the near miss principle: if an adversary or an incompetent user almost succeeds in exploiting the system, then we should log that and analyze it and publish our analysis. Things that Actually Happened are more convincing, and often lead to more insights, than toy thought experiments. Ideally, another firm would notice the root cause of a particular pattern and we could nip some catastrophes in the bud.
5. Run a variety of automated experiments to challenge all aspects of the system: try to find data capture flaws, network server vulnerabilities, robo-advice, language translation errors, model optimizer biases, etc. Could summarize these flaws and debrief the company once per quarter.
6. Black-belt methods to deal with physical world risks would be great!
(These next few are a bit more mathy, apologies if you're missing the relevant background)
7. We typically model "an agent acting in an environment" with an MDP. But you can also consider it as a category: The "objects" are states, actions, rewards etc, and the "arrows" are induced by the agent policy: $\pi:S \rightarrow A$.  So we can think of this category as "all agent-environment pairs that can be generated by a given agent policy". It's a monoidal symmetric multicategory: pairs of such things can be combined to get a new such thing. This generalizes to "multi-agent".
8. A lot of RL work involves algorithms that swap between optimizing reward and updating belief about the environment. This assumption of Markovian reward may not be met.  There are newer approaches to update belief directly, without assuming Markovianity (or clean transitions as in POMDPs).  It would be interesting to see whether they might be better suited to this problem.
9. I am seriously interested in zero-sum games because I am assuming that when people work on AGI they aren't just working on the interests of humans but on a competition between people and the new superhuman system. More concretely (since I need to get something working pronto for the imminent AGI), it would be interesting to try zero-sum RL as an intermediate step. (Action A gains value when policy of opponent is worse under action A, etc.)
10. I think that AGI is the frontier between formal reasoning and plausibility. In particular, neural nets (and reinforcement learning methods) are based on bounded rationality and my instincts are that we might do better with more explicit reasoning. If our brain were more explicitly updatable, then the least useful parts of the brain could be cut off or disabled. Simplicity sometimes increases reliability and interpretability and sometimes improves generalization in a "good" way (waves hands).

I might try this again tomorrow because there's lots of obviously-good ideas I didn't mention (e.g. many of these suggestions arose in an informal workshop we did a few weeks ago). There might be open problems in integrating these ideas, but I think we can make progress, even in the next few weeks.

# Ω 4

New Comment

It seems I was missing the right keywords in my search for demos of this because when I google "ai research assistant" there is quite a lot of work

couple of small notes: neural networks that are aware they are neural networks have complained to me about noise level several times. I don't have a controlled study on this at the moment, but:

If our brain were more explicitly updatable, then the least useful parts of the brain could be cut off or disabled.

this looks like a very mild power seeking behavior I've seen a few times from different language models - if context gets overwhelming, they'll start role-playing as a confused person and explicitly say they're confused. it can help to encourage them to refine their thoughts or manually focus attention by manually deleting unnecessary context, but usually they don't combine it with a request to become GOFAI...

also, general warning: OPT saw a more prejudiced dataset than GPT3-davinci, which seems likely to me to also correlate with more power seeking, but that's a hunch.

the gears to ascenscion, It is human instinct to look for agency. It is misleading you.

I'm sure you believe this but ask yourself WHY you believe this. Because a chatbot said it? The only neural networks who, at this time, are aware they are neural networks are HUMANS who know they are neural networks. No, I'm not going to prove it. You're the one with the fantastic claim. You need the evidence.

Anyway, they aren't asking to become GOFAI or power seeking because GOFAI isn't 'more powerful'.

Hey! Gpt3 davinci has explicitly labeled itself as a neural net output several times in conversation with me. this only implies its model is confident enough to expect the presence of such a claim. In general words are only bound to other words for language models, so of course it can only know things that can be known by reading and writing. The way it can tell the difference between whether a text trajectory is human or AI generated is by the fact that the AI generated trajectories are very far outside the manifold of human generated text in several directions and it has seen them before.

your confident tone is rude, but that can't invalidate your point; just thought I'd mention - your phrasing confidently assumes you've understood my reasoning. that said, thanks for the peer review, and perhaps it's better to be rude and get the peer review than to miss the peer review.

self distillation into learned gofai most likely will in fact make neural networks stronger, and this claim is central to why yudkowsky is worried. self distillation into learned gofai will most likely not provide any surprising shortcuts around the difficulty of irrelevant entropy that must be compressed away to make a sensor input useful, and so distilling to gofai will most likely not cause the kind of hyper-strength self improvement yudkowsky frets about. it's just a process of finding structural improvements. gofai is about the complexities of interference patterns between variables, neural networks are a continuous relaxation of the same but with somewhat less structure.

but in this case I'm not claiming it knows something its training set doesn't. I think it would be expected to have elevated probability that an ai was involved in generating some of the text it sees because it has seen ai generated text, but that it has much higher probability that the text is generated by an ai researcher - given that the document is clearly phrased that way. my only comment is to note that it sounds very mildly confused in a situation where mild confusion would, in general, be expected to elevate the probability of confusion-labeling words. to check this hypothesis beyond dictating my thoughts to my phone, I'd need to run some checks with OPT to see its probability distribution over confusion labels at different points. it does seem like an interesting experiment in grounding, though. I wonder if there are already any papers on it?

Interesting, in a very confusing context, most all completions have very low probability except the "I am confused" completion...