Charlie Steiner's Shortform

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
4 comments, sorted by Highlighting new comments since Today at 6:25 PM
New Comment

I think you can steelman Ben Goertzel-style worries about near-term amoral applications of AI being bad "formative influences" on AGI, but mostly under a continuous takeoff model of the world. If AGI is a continuous development of earlier systems, then maybe it shares some datasets and learned models with earlier AI projects, and definitely it shares the broader ecosystems of tools, dataset-gathering methodologies, model-evaluating paradigms, and institutional knowledge on the part of the developers. If the ecosystem in which this thing "grows up" is one that has previously been optimized for marketing, or military applications, or what have you, this is going to have ramifications in how the first AGI projects are designed and what they get exposed to. The more continuous you think the development is going to be, the more this can be intervened on by trying to make sure that AI is pro-social even in the short term.

Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question "What happens if I take action X?" where their source code provably does not actually do X.

A naive agent might notice the contradiction and decide that "What happens if I take action X?" is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated agent to be able to be aware of the contradiction and yet go on to say "Ah, but what I meant wasn't a question about the real world, but a question about some simplified model of the world that lumps all of the things that would normally be contradictory about this into one big node - I take action X, and also my source code outputs X, and also maybe even the programmers don't immediately see X as a bug."

Of course, the sophisticated agent doesn't have to bother saying that if it already makes plans using simplified models of the world that lump things together etc. etc. It's planning will thus implicitly deal with logical counterfactuals, and if it does verbal reasoning that taps into these same models, it can hold a conversation about logical counterfactuals. This seems pretty close to how humans do it.

Atheoretically building an agent that is good at making approximate models would therefore "accidentally" be a route to solving logical counterfactuals. But maybe we can do theory to this too: a theorem about logical counterfactuals is going to be a theorem about processes for building approximate models of the world - which it actually seems plausible to relate back to logical inductors and the notion of out-planning "simple" agents.

It seems like there's room for the theory of logical-inductor-like agents with limited computational resources, and I'm not sure if this has already been figured out. The entire trick seems to be that when you try to build a logical inductor agent, it's got some estimation process for math problems like "what does my model predict will happen?" and it's got some search process to find good actions, and you don't want the search process to be more powerful than the estimator because then it will find edge cases. In fact, you want them to be linked somehow, so that the search process is never in the position of taking advantage of the estimator's mistakes - if you, a human, are making some plan and notice a blind spot in your predictions, you don't "take advantage" of yourself, you do further estimating as part of the search process.

The hard part is formalizing this handwavy argument, and figuring out what other strong conditions need to be met to get nice guarantees like bounded regret.

Back in the "LW Doldrums" c. 2016, I thought that what we needed was more locations - a welcoming (as opposed to heavily curated a la old AgentFoundations), LW-style forum solely devoted to AI alignment, and then the old LW for the people who wanted to talk about human rationality.

This philosophy can also be seen in the choice to make the AI Alignment forum as a sister site to LW2.0.

However, what actually happened is that we now have non-LW forums for SSC readers who want to talk about politics, SSC readers who want to talk about human rationality, and people who want to talk about effective altruism. And meanwhile, LW2.0 and the alignment forum have sort of merged into one forum that is mostly talking about AI alignment but sometimes also has posts on COVID, EA, peoples' personal lives, and economics, and more rarely human rationality. Honestly, it's turned out pretty good.