If there is an easy way of fixing it sure, but I wouldn't devote more than a small amount of mental effort towards thinking up solutions, unless you don't see anywhere else you can plausibly help with alignment. Again, It's just a weird worst-case scenario that could only happen by a combination of incredible incompetence and astronomically bad luck, it's not even close to the default failure scenario.
I get you have anxiety and distress about this idea, but I don't think hyper-focusing on it will help you. Nobody else is going to hyper-focus on it, they'll just focus on dealing with bugs in general, so working alongside them seems like it'd be the most effective thing to do, rather than work on a weird super-specific problem by yourself.
We'll just have to agree to disagree here then. I just don't find this particular worst-case-scenario likely enough to worry about. Any AGI with a robust way of dealing with bugs will deal with this, and any AGI without that will far more likely just break or paperclip us.
You seem to have a somewhat general argument against any solution that involves adding onto the utility function in "What if that added solution was bugged instead?". While maybe this can be resolved, I think it's better to move on from trying to directly target the sign-flip problem and instead deal with bugs/accidents in general. After all, the sign-flip is just the very unlikely worse case scenario version of that, and any solution to dealing with bugs/accidents in an AGI will also deal with it.
Another small silver lining is we don't have to worry about making sure our alignment tools and processes can generalize, just that they scale. So they can be as tailor made to GPT as we want. I don't think this buys us much, as making an effective scalable tool in the first place seems like the much harder part than generalizing it.Agreed, GPT is very alien under the hood even though it's mimicking us, and that poses some problems. I'm curious however, just how good it's mimicry of us is/going to be, more specifically it's moral mimicry. If it shares the same conceptual understanding of "the right thing to do" as we do, maybe there's some way we can leverage that if it's good enough at it early on. I don't think I can recall GPT being given unique moral dilemmas off the top of my head, but it'd be interesting to see if it's on the right path or not currently.Also, has anyone made a chart showing plausible GPT level arrival dates yet? That seems like it would be very nice to have around to reference.
You've given me a lot to think about (and may have even lowered my confidence in some of my assertions). Kudos!
I do still have some thoughts to give in response though, but they don't really function as very in-depth responses to your points, as I'm still in the process of ruminating:
I agree with you that GPT-3 probably hasn't memorized the prompts given in the OP, it's too rare for that to be worth it. I just think it's so big and has access to so much data it really doesn't need to solve prompts like that. Take the Navy Seal Copypasta prompts Gwern did as an illustration. Those were essentially GPT-3 belching out it's connections in a very funny fashion. A lot of which were very weird/obscure. I just think people aren't truly appreciating the huge scope of this machine, and in a sense are underestimating it.
In some sense, I wish we could test GPT-3 how we could test animals for reasoning. Text is different from real life though, and If we put a totally new object in front of a crow, it could interact with it and learn how it works and use it to solve a difficult task, whereas I don't feel we can necessarily do the same with GPT-3. The closest we get in my opinion are the prompts where a novel word is introduced, but we usually have to describe it enough to work, and I feel like that defeats the discovery angle, which I feel is important.
Actually, now that I'm on this train of thought, a good prompt that could convince me there's some kind of reasoning going on could be built upon that. Perhaps, a prompt where a fictional very-alien animal (by this I mean, very unlike Earth animals, so with very weird "abilities") is in conflict with another fictional very-alien animal, but we don't give GPT much information about the two animals and the overall scenario, and we somehow force it to interrogate us until it understands enough to submit an accurate story about how the conflict could play out. This test is interesting, but I don't know how viable it is, we would need to get very creative and design two alien animals in a very in-depth manner, as well as how they interact, the environment, the planet etc. Perhaps a variant of this can be devised to reduce the difficult creative workload while still retaining the weird & alien (and thus less dataset interference) nature of it. I also am not familiar with GPT in a role as an "interrogator", and am not sure if this can be done currently. It tends to be the one interrogated, not the other way around.
I think a reason why I'm bullish about the pattern matching vs reasoning distinction, is cautiousness. Like you said, humans who are experienced enough can then skip the reasoning part and just start pattern-matching in the domains they're familiar with. GPT-3 is very "experienced" in it's own weird way, and it's amazing pattern-matching abilities could let it get past obstacles we place while unlike us, skipping the reasoning step. So, I feel like if we get convinced an AI is reasoning when it's not, we could deploy it in circumstances where it's lack of reasoning could cause problems and maybe even damage. I don't think this is much of a possibility with GPT-3, but it could with future versions, so I'd prefer to be cautious.
I had some other vague thoughts, but I've been awake for like 20 hours, and my brains weird but maybe useful tangent about alien animals chased them all away. Apologies lol.
I recognize the points you are making, and I agree, I don't want to be a person who sets an unfeasibly high bar, but with how GPT-3 was developed it's really difficult to put one that isn't near that height. If GPT-3 was instead made with mostly algorithmic advances instead of mostly scaling, I'd be a lot more comfortable placing said bar and a lot less skeptical, but it wasn't, and the sheer size of all this is in a sense intimidating.
The source of a lot of my skepticism is GPT-3's inherent inconsistency. It can range wildly from it's high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn't expect such inconsistency. Even when it is performing so well people call it "reasoning" it has enough artifacts of it's "non-reasoning" output to make me skeptical (logical contradictions, it's tendency to repeat itself i.e. "Because Gravity Duh" like in the OP, etc).
This is unfortunately just a very hard subject to get to the truth on. Illusions are meant to be convincing. Designing a test is difficult purely because the machine is like I said, intimidating. It has so much knowledge inside it's "head". It's pretty much safe to just assume it knows just about everything on the internet prior to it's creation. How do we design a test around that? Even when we get weird and start asking about stuff like reversing gravity and clouds becoming a solid substance, the internet is big and huge and full of weird questions, and the model is big enough to have representations of just about everything, including the weird.
So the only tests I can think to design around that are wholly unsatisfying. Like, asking it to replicate using reasoning some fact about something we discover next week that we also discovered purely through reasoning. This is technically a test, in the same way garbage is technically edible.
The ideal answer is we develop transparency & interpretation tools that allow us to crack open the huge black-box that is GPT-3 and truly grok how it works. But I don't hold out hope for this, as the ML community is for some reason I can't understand relatively uninterested in these types of tools, which is sad and somewhat worrying.
So to cut this short, I think what I stated near the beginning about the inconsistencies of the model is the best bet. If that's fixed, if you ask say GPT-4 a question and it almost always correctly determines whether you want a joke or a serious answer, if it lacks logical contradictions in a hypothetical news article you ask it to make, if it doesn't carry over all of the same errors as previous models, if it doesn't just regurgitate and repeat itself etc. I'd be a lot less skeptical about reasoning, as by that point it either has it, or it's pattern matching has somehow scaled well enough to iron out all of the problems and can probably be just as good as reasoning.
These are my thoughts, rambling as they may be. I apologize if this doesn't fully answer your comment, as I said this whole thing is just difficult to deal with, which isn't unexpected since it's the peak of modern technology. I'm also astonishingly bad at putting my thoughts down into words. If GPT-3 had anything like thoughts, it'd probably be better at this than me lol.
In a very loosely similar sense (though not at all accurate architectural sense) to how AlphaGo knows which moves are relevant for playing Go. I wouldn't say it was reasoning. It was just recognizing and predicting.
To give an example: If I were to ask various levels of GPT (perhaps just 2 and 3, as I'm not very familiar with the capabilities of the first version off the top of my head) "What color is a bloody apple" It would have a list of facts in it's "head" about the words "bloody" and "apple", like one can be red or green, one is depicted as various shades of red, and in some circumstances brown and black, one falls from trees, one is british slang etc. When the word "color" is added in, it does the same thing with that, and a primary color is red. Since all three share "red" as listed facts, that is incredibly relevant, and most likely to be the correct answer.
This is likely a poorly explained and inaccurate retelling of what GPT exactly does, but essentially I argue it's doing something closer to that than reasoning. It's always been doing that, and now with the third version, it does it well enough to give off the illusion of reasoning.
GPT-3 was trained on an astronomical amount of data from the internet, and asking weird hypotheticals is one of the internet's favorite pastimes. I would find it surprising if it was trained on no data resembling your prompts.
There's also the fact that it's representations are staggeringly complex. It knows an utterly absurd amount of facts "Off the top of it's head", including the mentioned facts about muzzle velocity, gravity, etc., and it's recognition abilities are great enough to recognize which of the facts it knows are the relevant ones based on the content of the prompt to get to the right answer, and then it outputs it.
That's roughly my view, although I'm not entirely satisfied with how I've explained it. Apologies. So yeah, none of the tests of reasoning so far have made me believe that it wasn't just doing something like the above to get it right. I'd be surprised if it was doing something resembling reasoning, as I don't think large model + large dataset is enough to learn how to do that. You also need a task that is difficult enough, and then maybe it will develop. (like real evolution), and I don't think text prediction fits that criteria.
Yeah, this sampling stuff brings up arguments about "curating" or "If you rephrase the same question and get a different answer then there is no reasoning/understanding here" which I'm sympathetic to.
I also think categorizing GPT-3's evasiveness, tendency to take serious prompts as joke prompts, etc. as solely the fault of the human is unfair. GPT-3 also shares the blame for failing to interpret the prompt correctly. This is hard task obviously, but that just means we have further to go, despite the machine's impressiveness already.
I still haven't been convinced GPT-3 is capable of reasoning, but I'm also starting to wonder if it's even that important. Roughly, all GPT-3 does is examine text, try to find a pattern, and continue it. But it is so massive, and trained on so much data that the patterns it can "see" and connections it can make are far more expansive than we'd expect. What this means, is while it doesn't try to comprehend any logical questions and then apply some kind of reasoning to answer it, it's ability to see patterns combined with it's staggeringly huge amount of data and connections it's made allow it to in a sense, "Brute Force" the answer anyway. This makes me believe pattern-matching can get us a lot farther than we previously thought it could, and do many things people held up before as needing reasoning.
This is all just the opinions of a guy who's only qualification is reading a lot of other opinions about this thing though.