PhD student at the University of Toronto, studying machine learning and working on AI safety problems.
Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.
Are there any sources that describe why 14 months is the trial period (or provide justification for picking trial periods of various lengths)?
Seems like it ought to be more of a continuous variable, rather than this discrete 14 month trial: at time t, we've observed x people for y months to see if they have wierd long-term side effects, so we should be willing to vaccinate z more people.
The chrome extention Netflix Party lets you synchronize playing the same video on netflix other people, which you can use along with Skype to watch something together.
(You can always fall back to counting down "3,2,1" to start playing the video at the same time, but the experience is nicer if you ever need to pause and resume)
The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
Might be interesting to look at information that was available at the start of H1N1 and how accurate it turned out to be in retrospect (though there's no guarantee that we'd make errors in the same direction this time around).
Virus mutates to a less severe form, quarantine measures select for the less severe form, fighting off less severe form provides immunity against more severe form, severe form dies out.
According to https://en.wikipedia.org/wiki/Spanish_flu
Another theory holds that the 1918 virus mutated extremely rapidly to a less lethal strain. This is a common occurrence with influenza viruses: There is a tendency for pathogenic viruses to become less lethal with time, as the hosts of more dangerous strains tend to die out (see also "Deadly Second Wave", above).
Article today suggested that COV19 has already split into two strains and hypothesized that selection pressure from quarantine changed the relative frequencies of the strains, don't think there's evidence about whether one strain is more severe https://academic.oup.com/nsr/advance-article/doi/10.1093/nsr/nwaa036/5775463?searchresult=1
I'm not an expert and this isn't great evidence, so it's maybe in the "improbable" category
I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
I agree, and think that this problem occurs both in imitation IA and RL IA
For example is the plan to make sure M2 has no such robustness problems (if so how)?
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There's maybe another hope that you could make M2 robust if you're allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.
Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.
if you can instruct them not to do things like instantiate arbitrary Turing machines
I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)
Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)
I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)