I don't have much direct experience with transformers (I was part of some research with BERT once where we found it was really hard to use without adding hard-coded rules on top, but I have no experience with the modern GPT stuff). However, what you are saying makes a lot of sense to me based on my experience with CNNs and the attempts I've seen to explain/justify CNN behaviour with side channels (for instance this medical image classification system that also generates text as a side output). See also my comment on Facebook.
I think what you're saying makes a lot of sense. When assembling a good training data set, it's all about diversity.
Sorry, I missed that somehow. Thanks.
(cross posting this comment from E. S. Yudkowksy's Facebook with some edits / elaboration)Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
Note this is similar to the "self explaining AI" idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it's not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing "thoughts" could help us trust/determine how much a model really understands what's going on or how to make a good story.
However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output. The way GPT-3 works, I suspect, is very far from how humans think. GPT-3 is very bad at a lot of common sense and physics-based reasoning, for instance, yet based on the thoughts output the user might be misled into thinking the model understands common sense notions or physics since it's spouting off a version of some stuff it got from it's training data. Any work along these lines would definitely need empirical testing / studies to show that the extra "thoughts" output is useful to end-users in some way (like predicting failure modes or helping debug failures).Also, I'm unclear on what constitutes a "run"... roughly how long does the text have to be, in words, to have a chance at getting $20,000?
Note: Pfizer started a trial in September to try to answer this question. We may know answer in a few months. In theory I don't see why it wouldn't work but with limited supply there's probably better uses at least in the next few months. Also, note the initial EUA application is asking it be approved for high-risk patients only, probably because Pfizer was told by FDA it wouldn't be EUA'd otherwise. Paxlovid must be taken with Ritonavir (otherwise Paxlovid breaks down to fast) which messes with liver enzymes and isn't a good choice for many (potential for many drug-drug interactions). People from the pro-Ivermectin crowd on Twitter have been complaining about how Ritonavir has a pretty heavy side effect profile. I haven't researched it at all but did take a quick look at the FDA Drug Facts sheet and apparently 57% get nausea, which doesn't sound good.
Very cool, will take a look. This basically solves question 1. It seems the original Solomonoff work isn't published anywhere. By the way, the author, William H. Press, is a real polymath! I am curious if there is any extension of this work to agents with finite memory.. as an example, the same situation where you're screening a large number of people, but now you have a memory where you can store N results of prior screenings for reference. I'm going to look into it..
Here's another paper on small / non-robust features, but rather specific to patch-based vision transformers: Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation^ This work is very specific to patch-based methods. Whether patches are here to stay and for how long is unclear to me, but right now they seem to be on an ascendancy (?).
For what it's worth - I see value in votes being public by default. It can be very useful to see who upvoted or downvoted your comment. Of course then people will use the upvote feature just to indicate they read a post, but that's OK (we are familiar with that system from Facebook, Twitter, etc). I'm pretty apathetic about all the other proposals here. Reactions seem to me to be unnecessary distractions. [side note - emojiis are very ambiguous so it's good you put words next to each one to explain what they are supposed to mean]. The way I would interpret reactions would be as a poll of people's system 1 snap judgements. That is arguably useful/interesting information in many contexts but also distracting in other contexts.
I'm curious why this comment has such low karma and has -1 alignment forum karma. If you think doom is very likely when AI reaches a certain level, than efforts to buy us time before then have the highest expected utility. The best way to buy time, arguably, is to study the different AI approaches that exist today and figure out which ones are the most likely to lead to dangerous AI. Then create regulations (either through government or at corporation level) banning the types of AI systems that are proving to be very hard to align. (For example we may want to ban expected reward/utility maximizers completely - satisficers should be able to do everything we want. Also, we may decide there's really no need for AI to be able to self modify and ban that too.) Of course a ban can't be applied universally, so existentially dangerous types of AI will get developed somewhere somehow, and there's likely to be existentially dangerous types of AI we won't have thought of that will still get developed, but at least we'll be able to buy some time to do more alignment research that hopefully will help when that existentially dangerous AI is unleashed. (addendum: what I'm basically saying is that prosaic research can help us slow down take-off speed which is generally considered a good thing).
Also... alignment is obviously continuum and of course 100% alignment with all human values is impossible. A different thing you could prove is whether it's possible to guarantee human control over an AI system as it becomes more intelligent. There's also a concern that a slightly unaligned system may become more and more aligned as its intelligence is scaled up (either by humans re-building/trianing it with more parameters/hardware or via recursive self-improvement). It would useful if someone could prove whether that is impossible to prevent. I need to think about this more and read Yampolsky's paper to really understand what would be the most useful to prove is possible or impossible.