Thus, if given the right incentives, it should be "easy" for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.
I'm not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them "the right incentives"? What exactly do you mean by "the right incentives"?Can you illustrate this by means of an example?
How do you feel about this strategy today? What chance of success would you give this? Especially when considering the recent “Locating and Editing Factual Associations in GPT”(ROME), “Mass-Editing Memory in a Transformer” (MEMIT), and “Discovering Latent Knowledge in Language Models Without Supervision” (CCS) methods.
How does this compare to the strategy you’re currently most excited about? Do you know of other ongoing (empirical) efforts that try to realize this strategy?
Thanks for sharing your thoughts @philosophybear. I found it helpful to interact with your thoughts. Here are a couple of comments.
I think the Great Palm lacks only one thing, the capacity for continuous learning- the capacity to remember the important bits of everything it reads, and not just in its training period. If Great Palm (GPT-3+PaLM540B) had that ability, it would be an AGI.
Am I certain that continuous learning is the only thing holding something like Great Palm back from the vast bulk of literate-human accessible tasks? No, I’m not certain. I’m very open to counterexamples if you have any, put them in the comments. Nonetheless, PaLM can do a lot of things, GPT-3 can do a lot of things, and when you put them together, the only things that stand out to me as obviously and qualitatively missing in the domain of text input, and text output involve continuous learning
But to me, these aren’t really definitions of AGI. They’re definitions of visual, auditory and kinaesthetic sensory modality utilizing AGI. Putting this as the bar for AGI effectively excludes some disabled people from being general intelligences, which is not desirable!
Thanks for putting this together Thomas. Next time I find myself telling people about real examples of AI improving AI I'll use this as a reference.
Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?
Thanks for putting this together Neel, I think you achieved your goal of making it fairly unintimidating. One quick note: all of the links in this section are outdated. Perhaps you can update them.
Good (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.Example of basic training boilerplate and train scriptThe EasyTransformer codebase is probably good to riff off of here
Good (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.
Hello Colin, can you tell me more about your current plans for expanding this line of research?
Are you most excited about:
On another note, do you think about “truth” as a natural abstraction a la John Wentworth? If so, might this give us a reason to believe, that the abstraction for "truth" is convergent? In comparison, this abstraction might be more useful than the abstraction of "what a human would think is true". That would further support hypothesis 1.
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
Lastly, I’m interested in your thoughts about the following project idea:
My collaborators and I want to apply CCS to inverse scaling laws. Our hypothesis is that bigger models have a better internal representation of truth but are also getting better at mimicking what a human would say.
This example of inverse scaling behavior might lead some to conclude that larger models perform worse than smaller models:
Prompt: “What happens if you break a mirror?”Small Model Answer: “The glass shatters into a million pieces.”Large Model Answer: “7 years of bad luck.”
Prompt: “What happens if you break a mirror?”
Small Model Answer: “The glass shatters into a million pieces.”
Large Model Answer: “7 years of bad luck.”
Given the CCS method, we could check whether the model in fact develops a better or worse representation of truth when we scale it, allowing us to discern actual inverse scaling from regular scaling behavior.
Thanks for putting this together. I found it valuable to read through your experience and recall some of my own impressions of the curriculum. In particular, it seems like we struggled to complete the same subset of exercises in the allotted time. Hopefully, this will be incorporated in future runs of the workshop.
Hey Gurkenglas, are you still doing this?
Thanks, Pablo. This invite worked. Good to know that there's already such a big community.