I particularly appreciate the questions that ask one to look at a way that a problem was reified/specified/ontologized in a particular domain and asks for alternative such specifications. I thought Superintelligence (2014) might be net harmful because it introduced a lot of such specifications that I then noticed were hard to think around. I think there are a subset of prompts from the online course/book Framestorming that might be useful there, I'll go see if I can find them.
I also have this impression regarding Superintelligence. I'm wondering if you have examples of a particular concept or part of the framing that you think was net harmful?
Speed, collective, quality superintelligences is the one that sounds most readily to mind, but quite a few of the distinctions struck me this way at the time I read it.
I also thought the treacherous turn, and the chapter on multipolar cooperation baked in a lot of specifics.
I really worry about this and it has become quite a block. I want to support fragile baby ontologies emerging in me amidst a cacophony of "objective"/"reward"/etc. taken for granted.
Unfortunately, going off and trying to deconfuse the concepts on my own is slow and feedback-impoverished and makes it harder to keep up with current developments.
I think repurposing "roleplay" could work somewhat, with clearly marked entry and exit into a framing. But ontological assumptions absorb so illegibly that deliberate unseeing is extremely hard, at least without being constantly on guard.
Are there other ways that you recommend (from Framestorming or otherwise?)
I think John Cleese's relatively recent book on creativity and Olicia Fox Cabane's The Butterfly and the Net are both excellent.
Great list of interesting questions, trains of thought and project ideas in this post.
I was a little surprised to not find any exercises on interpretability. Perhaps there was a reason for excluding it, but if not then here an idea for another exercise/group of exercises to include (perhaps could be merged into the "Neural networks" section):
Interpretability
Yeah, I'm keen to add exercises on interpretability. I like the direction of your one but it feels a bit too hard, in the sense that it's a pretty broad request which is difficult to know where to start or how much progress they're making. Any ideas on what more specific things we could ask them to do, or ways to make the exercise more legible to them?
That's a fair point. I had thought this would be around the same level of difficulty as some of the exercises in the list such as "Produce a proposal for the ELK prize". But I'm probably biased because I have spent a bit of time working in this area already.
I don't know off the top of my head any ways to decompose the problem or simplify it further, but I'll post back if I think of any. I think it will help as Lucid and Lucent get better, or perhaps if Anthropic open-sources their interpretability tooling. That could make it significantly easier to onboard people to these kinds of problems and scale up the effort.
Difference IMO is mainly that Circuits steps you through the problem in a way designed to help you understand their thinking, whereas ELK steps you through the problem in a way designed to get people to contribute.
(Perhaps "produce a proposal for something to investigate" might be of a similar difficulty as the ELK prize, but also Circuits work is much more bottom-up so it seems hard to know what to latch onto before having played around a bunch. Agreed that new tooling for playing around with things would help a lot.)
This post, by example, seems like a really good argument that we should spend a little more effort on didactic posts of this sort. E.g. rather than just saying "physical systems have multiple possible interpretations," we could point people to a post about a gridworld with a deterministic system playing the role of the agent, such that there are a couple different pretty-good ways of describing this agent that mostly agree but generalize in different ways.
This perspective might also be a steelmanning of that sort of paper where there's an abstract argument that does all the work, and then some code that tells you nothing new if you followed the abstract argument. The code (in the steelmanned story) isn't just there to make the paper be in the right literary genre or provide a semi-trustworthy signal that you're not a crank who makes bad abstract arguments, it's a didactic tool to help the reader do these sorts of exercises.
Curated. Exercises are crucial for the mastery of topics and the transfer of knowledge, it's great to see someone coming up with them for the nebulous field of Alignment.
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here's my alternative answer:
CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.
In other words it's only a solution to "Learn from Teacher" in Paul's 2019 decomposition of alignment, not to the whole alignment problem.
I thought about Agency Q4 (counterargument to Pearl) recently, but couldn't come up with anything convincing. Does anyone have a strong view/argument here?
I don't see any claim that it's impossible for neural nets to handle causality. Pearl's complaining about AI researchers being uninterested in that goal.
I suspect that neural nets are better than any other approach at handling the hard parts of causal modeling: distinguishing plausible causal pathways from ridiculous ones.
Neural nets currently look poor at causal modeling for roughly the same reason that High Modernist approaches weren't willing to touch causal claims: without a world model that's comprehensive enough to approximate common sense, causal modeling won't come close to human-level performance.
A participant in Moderna's vaccine trial was struck by lightning. How much evidence is that for our concern that the vaccine is risky?
If I try to follow the High Modernist approach, I think it says something like we should either be uncertain enough to avoid any conclusion, or we should treat the lightning strike as evidence of vaccine risks.
As far as I can tell, AI approaches other than neural nets perform like scientists who blindly follow a High Modernist approach (assuming the programmers didn't think to encode common sense about whether vaccines affect behavior in a lightning-strike-seeking way).
Whereas GPT-3 has some hints about human beliefs that make it likely to guess a little bit better than the High Modernist.
GPT-3 wasn't designed to be good at causality. It's somewhat close to being a passive observer. If I were designing a neural net to handle causality, I'd give it an ability to influence an environment that resembles what an infant has.
If there are any systems today that are good at handling causality, I'd guess they're robocar systems. What I've read about those suggests they're limited by the difficulty of common sense, not causality.
I expect that when causal modeling becomes an important aspect of what AI needs for further advances, it will be done with systems that use neural nets as important components. They'll probably look a bit more like Drexler's QNR than like GPT-3.
Just a quick logistical thing: do you have any better source of Pearl making that argument? The current quanta magazine link isn't totally satisfactory, but I'm having trouble replacing it.
It's currently hard to know where to start when trying to get better at thinking about alignment. So below I've listed a few dozen exercises which I expect to be helpful. They assume a level of background alignment knowledge roughly equivalent to what's covered in the technical alignment track of the AGI safety fundamentals course. They vary greatly in difficulty - some are standard knowledge in ML, some are open research questions. I’ve given the exercises star ratings from * to *** for difficulty (note: not for length of time to complete - many require reading papers before engaging with them). However, I haven't tried to solve them all myself, so the star ratings may be significantly off.
I've erred on the side of including exercises which seem somewhat interesting and alignment-related even when I'm uncertain about their value; when working through them, you should keep the question "is this actually useful? Why or why not?" in mind as a meta-exercise. This post will likely be updated over time to remove less useful exercises and add new ones.
I'd appreciate any contributions of:
Reward learning
Agency
Reinforcement learning
Neural networks
Alignment theory
Agent foundations
Evolution and economics
Some important concepts in ML
These are intended less as exercises and more as pointers to open questions at the cutting edge of deep learning.
Miscellaneous