# 21

Hi there, my background is in AI research and recently I have discovered some AI Alignment communities centered around here. The more I read about AI Alignment, the more I have a feeling that the whole field is basically a fictional-world-building exercise.

Some problems I have noticed: The basic concepts (e.g. what are the basic properties of the AI that are being discussed) are left undefined. The questions answered are build on unrealistic premises about how AI systems might work. Mathiness - using vaguely defined mathematical terms to describe complex problems and then solving them with additional vaguely defined mathematical operations. Combination of mathematical thinking and hand-wavy reasoning that lead to preferred conclusions.

Maybe I am reading it wrong. How would you steelman the argument that AI Alignment is actually a rigorous field? Do you consider AI Alignment to be scientific? If so, how is it Popper-falsifiable?

New Comment

# 9 Answers sorted by top scoring

Charlie Steiner

### Jan 23, 2022

410

Rather than Popper, we're probably more likely to go with Kuhn and call this "pre-paradigmatic." Studying something without doing science experiments isn't the real problem (history departments do fine, as does math, as do engineers designing something new), the problem is that we don't have a convenient and successful way of packaging the problems and expected solutions (a paradigm).

That said, it's not like people aren't trying. Some papers that I think represent good (totally non-sciency) work are Quantilizers, Logical Induction, and Cooperative Inverse Reinforcement Learning. These are all from a while ago, but that's because I picked things that have stood the test of time.

If you only want more "empirical" work (even though it's still in simulation) you might be interested in Deep RL From Human Preferences, An Introduction to Circuits, or the MineRL Challenges (which now have winners).

Thanks for your reply. Popper-falsifiable does not mean experiment-based in my books. Math is falsifiable -- you can present a counterexample, error in reasoning, a paradoxical result, etc. Similarly to history, you can often falsify certain claims by providing evidence against. But you can not falsify a field where every definition is hand-waved and nothing is specified in detail. I agree that AI Alignment has pre-paradigmic features as far as Kuhn goes. But Kuhn also says that pre-paradigmic science is rarely rigorous or true, even though it might produce some results that will lead to something interesting in the future.

"Every definition is hand-waved and nothing is specified in detail" is an unfair caricature.

Yeah, but also this is the sort of response that goes better with citations.

Like, people used to make a somewhat hand-wavy argument that AIs trained on goal X might become consequentialists which pursued goal Y, and gave the analogy of the time when humans 'woke up' inside of evolution, and now are optimizing for goals different from evolution's goals, despite having 'perfect training' in some sense (and the ability to notice the existence of evolution, and its goals). Then eventually someone wrote Risks from Learned Optimization in Advanced Machine Learning Systems in a way that I think involves substantially less hand-waving and substantially more specification in detail.

Of course there are still parts that remain to be specified in detail--either because no one has written it up yet (Risks from Learned Optimization came from, in part, someone relatively new to the field saying "I don't think this hand-wavy argument checks out", looking into it a bunch, being convinced, and then writing it up in detail), or because we don't know what we're looking for yet. (We have a somewhat formal definition of 'corrigiblity', but is it the thing that we actually want in our AI designs? It's not yet clear.)

7RyanCarey
In terms of trying to formulate rigorous and consistent definitions, a major goal of the Causal Incentives Working Group is to analyse features of different problems using consistent definitions and a shared framework. In particular, our paper "Path-specific Objectives for Safer Agent Incentives" (AAAI-2022) will go online in about month, and should serve to organize a handful of papers in AIS.
1mocny-chlapik
Thanks, this looks very good.

TekhneMakre

### Jan 24, 2022

80

The objects in question (super-intelligent AIs) don't currently exist, so we don't have access to real examples of them to study. One might still want to study them because it seems like there's a high chance they will exist. So indirect access seems necessary, e.g. conceptual analysis, mathematics, hand-wavy reasoning (specifically, reasoning that's hand-wavy about some things but tries to be non-hand-wavy about at least some other things), reasoning by analogy with non-super-intelligent things like humans, animals, evolution, or contemporary machine learning (on which we can do more rigorous reasoning and experiments). This is unfortunate but seems unavoidable. Do you see a way to study super-intelligent AI more rigorously or scientifically?

JBlack

### Jan 24, 2022

60

The field of AI alignment is definitely not a rigorous scientific field, but nor is it anything like a fictional-world-building exercise. It is a crash program to address an existential risk that appears to have a decent chance of happening, and soon in the timescale of civilization, let alone species.

By its very nature it should not be a scientific field in the Popperian sense. By the time we have any experimental data on how any artificial general superintelligence behaves, the field is irrelevant. If we could be sure that it wasn't possible to happen soon, we could take more time to probe out the field and start the likely centuries-long process to make it more rigorous.

So I answer your question by rejecting it. You have presented a false dichotomy.

In my experience, doing something poorly takes more time than doing it properly.

Richard_Kennaway

### Jan 23, 2022

50

There are multiple questions here: is AGI an existential threat?, and if so, how can we safely make and use AGI? Or if that is not possible, how can we prevent it being made?

There are strong arguments that the answer to the first question is yes. See, for example, everything that Eliezer has said on the subject. Many others agree; some disagree. Read and judge.

What can be done to avoid catastrophe? The recent dialogues with Eliezer posted here indicate that he has no confidence in most of the work that has been done on this. The people who are doing it presumably disagree. Since AGI has not yet been created, the work is necessarily theoretical. Evidence here consists of mathematical frameworks, arguments, and counterexamples.

Templarrr

### Jan 24, 2022

40

I'd rather call it proto- not pseudo- science. Currently it's alchemy before chemistry was a thing.

There is a real field somewhere adjacent to the discussions lead here and people are actively searching for it. AGI is coming , you can argue the timeline, but not the event (well, unless humanity destroys itself with something else first). And artificial systems we now have often shows unexpected and difficult to predict properties. So the task "how can we increase difficulty and capabilities of AI systems, possibly to the point of AGI, while simultaneously decreasing unpredictable and unexpected side effects" is perfectly reasonable.

The problem is that current understanding of the systems and entire framework is on the level of Ptolemy astronomy. A lot of things discussed at this moment will be discarded, but some grains of gold will become new science.

TBH I have a lot of MAJOR questions to the current discourse, it's plagued by misunderstanding of what and how is possible in artificial intelligence systems, but I don't think it should stop. The only way we can find the solution is by working on it, even if 99% of the work will be meaningless in the end.

Koen.Holtman

### Jan 25, 2022

30

There is a huge diversity in posts on AI alignment on this forum. I'd agree that some of them are pseudo-scientific, but many more posts fall in one of the following categories:

1. authors follow the scientific method of some discipline, or use multidisciplinary methods,

2. authors admit outright that they are in a somewhat pre-scientific state, i.e. they do not have a method/paradigm yet that they have any confidence in, or

3. authors are talking about their gut feelings of what might be true, and again freely admit this

Arguably, posts of type 2 and 3 above are not scientific, but as they do not pretend to be, we can hardly call them pseudo-scientific.

That being said, this forum is arguably a community, but its participants do not cohere into anything as self-consistent as a single scientific or even pseudo-scientific field.

In a scientific or pseudo-scientific field, the participants would at least agree somewhat on what the basic questions and methods are, and would agree somewhat on which main questions are open and which have been closed. On this forum, there is no such agreement. Notably, there are plenty of people here who make a big deal out of distrusting not just their own paradigms, but also those used by everybody else, including of course those used by 'mainstream' AI research.

If there is any internally coherent field this forum resembles, it is the field of philosophy, where you can score points by claiming to have a superior lack of knowledge, compared to all these other deep thinkers.

delton137

### Jan 24, 2022

20

It's a mixed bag. A lot of near term work is scientific, in that theories are proposed and experiments run to test them, but from what I can tell that work is also incredibly myopic and specific to the details of present day algorithms and whether any of it will generalize to systems further down the road is exceedingly unclear.

The early writings of Bostom and Yudkowsky I would classify as a mix of scientifically informed futurology and philosophy. As with science fiction, they are laying out what might happen. There is no science of psychohistory and while there are better and worse ways of forecasting the future (see "Superforecasting") when it comes to forecasting how future technology will play out it's especially impossible because future technology depends on knowledge we by definition don't have right now. Still, the work has value even if it is not scientific, by alerting us to what might happen. It is scientifically informed because at the very least the futures they describe don't violate any laws of physics. That sort of futurology work I think is very valubale because it explores the landscape of possible futures so we can identify the futures we don't want so we we can takes steps to avoid those futures, even if the probability of any given future scenario is not clear.

A lot of the other work is pre-paradigmatic, as others have mentioned, but that doesn't make it pseudoscience. Falsifiability is the key to demarcation. The work that borders on pseudoscience revolves heavily around the construction of what I call "free floating" systems. These are theoretical systems that are not tied into existing scientific theory (examples: laws of physics, theory of evolution, theories of cognition, etc) and also not grounded in enough detail that we can test whether the ideas / theories are useful/correct right now. They aren't easily falsifiable. These free-floating sets of ideas tend to be hard for outsiders to learn since they involve a lot of specialized jargon and because sorting wheat from chaffe is hard because they don't bother to subject their work to the rigors of peer review and publication in conferences / journals, which provide valuable signals to outsiders as to what is good or bad (instead we end up with a huge lists of Alignment Forum posts and other blog posts and PDFs with no easy way of figuring out what is worth reading). Some of this type of work blends into abstract mathematics. Safety frameworks like iterated distillation & debate, iterated amplification, and a lot of the MIRI work on self-modifying agents seem pretty free-floating to me (some of these ideas may be testable in some sort of absurdly simple toy environment today, but what these toy models tell us about more general scenarios is hard to say without a more general theory). A lot of the futurology stuff is also free floating (a hallmark of free floating stuff is zany large concept maps like here). These free floating things are not worthless but they also aren't scientific.

Finally, there's much that is philosophy. First, of course, there's debates about ethics. Secondly there's debates about how to define basic terms that are heavily used like intelligence, general vs narrow intelligence, information, explanation, knowledge, and understanding.

TAG

### Jan 25, 2022

10

Reasoning about AGI is similar to reasoning about black holes: both of these do not necessarily lead to pseudo-science, though both also attract a lot of fringe thinkers, and not all of them think robustly all of the time

For the two to be similar, there needs to be an equivalent to the laws of physics. Then the cranks would be the people who are ignoring them. But, despite the expenditure of a lot of effort, no specific laws of AGI have been found .

(Of course, AGI is subject to the same general laws as any form of computation).

It is your opinion that despite the expenditure of a lot of effort, no specific laws of AGI have been found. This opinion is common on this forum, it puts you in what could be called the 'pre-paradigmatic' camp.

My opinion is that the laws of AGI are the general laws of any form of computation (that we can physically implement), with some extreme values filled in. See my original comment. Plenty of useful work has been done based on this paradigm.

1TAG
Maybe it's common now. During the high rationalist era, early 2010s, there was supposed to be a theory of AGI based on rationality. The problem was that ideal rationality is uncomputable, so that approach would involve going against what is already known about computation, and therefore crankish. (And the claim that any AI is non ideally rational, whilst defensible for some values of non ideallyrational, is not useful, since there are many ways of being non-ideal).
1Koen.Holtman
I am not familiar with the specific rationalist theory of AGI developed in the high rationalist era of the early 2010s. I am not a rationalist, but I do like histories of ideas, so I am delighted to learn that such a thing as the high rationalist era of the early 2010s even exists. If I were to learn more about the actual theory, I suspect that you and I would end up agreeing that the rationalist theory of AGI developed in the high rationalist era was crankish.
1TAG
Yes. I was trying to avoid the downvote demon by hinting quietly. PS looks like he winged me.

### Jan 23, 2022

10

I agree, I wouldn't consider AI alignment to be scientific either. How is it a "problem" though?

I think you get it mostly right, and then you just make a different conclusion.

The part where you agree is:

We do not have a scientific understanding of how to tell a superintelligent machine to [solve problem X, without doing something horrible as a side effect], because we cannot describe mathematically what "something horrible" actually means to us...

And the conclusion that AI safety people make is:

...and that is a problem, because in the following years, machines smarter than humans are likely to come, and they may do things with horrible side effects that their human operators will not predict.

While your conclusion seems to be:

So, if you want to be a proper Popperian, you probably need to sit and wait until actual superintelligent machines are made and actually start doing horrible things, and then (assuming that you survive) you can collect and analyze examples of the horrible things happening, propose falsifiable hypotheses on how to avoid these specific horrible things happening again, do the proper experiments, measure the p-values, and publish in respected scientific journals. This is how respectable people would approach the problem.

The alternative is to do the parts that you can do now... and handwave the rest of it, hoping that later someone else will fill in the missing parts. For example, you can collect examples of surprising things that current (not superintelligent) machines are making when solving problems. And the handwavy part is "...and now imagine this, but extrapolated for a superintelligence".

Or you can make a guess about which mathematical problems may turn out to be relevant for AI safety (although you cannot be sure you guessed right), and then work on those mathematical problems rigorously. In which case the situation is like: "yeah, this math problem is solved okay from the scientific perspective, it's just its relevance for AI safety that is dubious".

I am not familiar with the AI safety research, so I cannot provide more information about it. But my impression is that it is similar to a combination of what I just described: examples of potential problems (with non-superintelligent machines), and mathematical details which may or may not be relevant.

The problem with "pop Popperianism" is that it describes what to do when you already have a scientific hypothesis fully formed. It does not concern itself with how to get to that point. Yes, the field of AI safety is currently mostly trying to get to that point. That is the inevitable first step.

We do not have a scientific understanding of how to tell a superintelligent machine to "solve problem X, without doing something horrible as a side effect", because we cannot describe mathematically what "something horrible" actually means to us...

Where is this quote from? I don't see it in the article or in the author's other contributions.

Sorry, I used the quote marks just as... brackets, kind of?

(Is that a too non-standard usage? What is the proper way to put a clearly separated group of words into a sentence, without making it seem like a quotation? Sometimes connecting-by-hyphens does the job, but it seems weird when the text gets longer.)

EDIT: Okay, replaced by actual brackets. Sorry for all the confusion I caused.

I expect most readers of your original comment indeed misinterpreted those quotes to be literal when they're anything but. Maybe edit the original comment and add a bunch of "(to paraphrase)"s or "as I understand you"s?

I think in this case brackets is pretty good. I agree with Martin that it's good to avoid using quote marks when it might be mistaken for a literal quote.

FWIW, I have a tendency to do quote-grouping for ideas sometimes too, but it's pretty tough to read unless your reader has a lot of understanding in what you're doing. Although it's both ugly and unclear, I prefer to use square brackets because people at least know that I'm doing something weird, though it still kinda looks like I'm [doing some weird paraphrasing thing].

I couldn't click upvote hard enough. I'm always having this mental argument with hypothetical steelmanned opponents about stuff and AI Safety is sometimes one of the subjects. Now I've got a great piece of text to forward to these imaginary people I'm arguing with!

"pseudoscience" is a kind of word that is both too broad and loaded with too many negative connotations.  It encompasses both (say) intelligent design with it's desired results built-in and AI safety striving towards ...something. The word doesn't seem useful in determining which you should take seriously.

I feel like I've read a post before about distinguishing between insert-some-pseudoscience-poppycock-here, and a "pseudoscience" like AI safety.  Or, someone should write that post!

We do not have a scientific understanding of how to tell a superintelligent machine to "solve problem X, without doing something horrible as a side effect", because we cannot describe mathematically what "something horrible" actually means to us...

Similar to how utility theory (from von Neumann and so on) is excellent science/mathematics despite our not being able to state what utility is. AI Alignment hopes to tell us how to align AI, not the target to aim for. Choosing the target is also a necessary task, but it's not the focus of the field.

It is not a quote but a paraphrasing of what the OP might agree on about AI security.

I think your critique would be better understood were it more concrete. For example, if you write something like

"In the paper X, authors claim that AI alignment requires the following set of assumptions {Y}, which they formalize using a set of axioms {Z}, used to prove a number of theorems {T}. However, the stated assumptions are not well motivated, because  [...] Furthermore, the transition from Y to Z is not unique, because of [a counterexample]. Even if the axioms Z are granted, the theorems do not follow without [additional unstated restrictions]. Given the above caveats, the main results of the paper, while mathematically sound and potentially novel, are unlikely to contribute to the intended goal of AI Alignment because [...]."

then it would be easier for the MIRI-adjacent AI Alignment community to engage with your argument.

Thanks for you reply. I am aware of that, but I didn't want to reduce the discussion to particular papers. I was curious about how other people read this field as a whole and what's their opinion about it. One particular example I had in mind is the Embedded Agency post often mentioned as a good introductory material into AI Alignment. The text often mentions complex mathematical problems, such as halt problem, Godel's theorem, Goodhart's law, etc. in a very abrupt fashion and use these concept to evoke certain ideas. But a lot is left unsaid, e.g. if Turing completeness is evoked, is there an assumption that AGI will be deterministic state machine? Is this an assumption for the whole paper or only for that particular passage? What about other types of computations, e.g. theoretical hypercomputers? I think it would be beneficial for the field if these assumptions would be stated somewhere in the writing. You need to know what are the limitations of individual papers, otherwise you don't know what kind of questions were actually covered previously. E.g. if this paper covers only Turing-computable AGI, it should be clearly stated so others can work on other types of computations.

the Embedded Agency post often mentioned as a good introductory material into AI Alignment.

For the record: I feel that Embedded Agency is a horrible introduction to AI alignment. But my opinion is a minority opinion on this forum.

I don't think there's anyone putting his crecedence on hypercomputation becoming a problem. I've since been convinced that turing machines can do (at least) everything you can "compute".

I am no AI expert. Still, I have some views about AI alignment, and this is an excellent place to share them.

[I'm stating the following as background for the rest of my comment.] AI alignment splits nicely into:

• Inner alignment: aligning an agent with a goal.
• Outer alignment: aligning a goal with a value.

The terms agent and value are exceptionally poorly defined. What even is an agent? Can we point to some physical system and call it an agent? What even are values?

Our understanding of agents is limited, and it is an "I know it when I see it" sort of understanding. We know that humans are agents. Humans are agents we have right before our eyes today, unlike the theoretical agents with which AI alignment is concerned. Are groups of agents also agents? E.g. is a market, a nation or a government an agent made up of subagents?

If we agree that humans are agents, then do we understand how to align human beings towards desirable goals that align with some values? If we don't know how to align human beings effectively, what chances do we have of aligning theoretical agents that don't yet exist?

Suppose that your goal is to develop vaccines for viral pandemics. You have no idea how to make vaccines for existing viruses. Instead of focusing on learning the knowledge needed to create vaccines for existing viruses, you create models of what viruses might theoretically look like 100 years from now based on axioms and deduced theorems. Once you have these theoretical models, you simulate theoretical agents, viruses and vaccines and observe how they perform in simulated environments. This is useful indeed and could lead to significant breakthroughs, but we have a tighter learning loop by working with real viruses and real agents interacting in real environments.

In my eyes, the problem of AI alignment is more broadly a problem of aligning the technology we humans create towards fulfilling human values (whatever human values are). The problem of aligning the technology we make towards human values is a problem of figuring out what those values are and then figuring out incentive schemes to get humanity to cooperate towards achieving those values. Given the abysmal state of international cooperation, we are doing very badly at this.

Once I finished writing the above, I had some second thoughts. I was reminded of this essay written by Max Tegmark:

It was OK for wisdom to sometimes lag in the race because it would catch up when needed. With more powerful technologies such as nuclear weapons, synthetic biology, and future strong artificial intelligence, however, learning from mistakes is not a desirable strategy: we want to develop our wisdom in advance so that we can get things right the first time because that might be the only time we’ll have.

The above quote highlights what is so unique about AI safety. The best course of action might be to work with theoretical agents because we may have no time to solve the problem when the superintelligent agents arrive. The probability of my house being destroyed is small, but I still pay for insurance because that's the rational thing to do. Similarly, even if the probability of catastrophic risk resulting from superintelligence is small, it's still prudent to invest in safety research.

That said, I still stand by my earlier stances. Working towards aligning existing agents and working towards aligning theoretical agents are both crucial pursuits.

Do you regard Concrete AI Safety Problems as a fictional world-building exercise? Or are you classifying that as "AI Safety" as opposed to "AI Alignment"?

I think that AI Safety can be a subfield of AI Alignment, however I see a distinction between AI as current ML models and AI as theoretical AGI.

Okay, so "AI Alignment (of current AIs)" is scientific and rigorous and falsifiable, but "AGI Alignment" is a fictional world-building exercise?

Yeah, that is somewhat my perception.

In physics, we can try to reason about black holes and the big bang by inserting extreme values into the equations we know as the laws of physics, laws we got from observing less extreme phenomena. Would this also be 'a fictional-world-building exercise' to you?

Reasoning about AGI is similar to reasoning about black holes: both of these do not necessarily lead to pseudo-science, though both also attract a lot of fringe thinkers, and not all of them think robustly all of the time.

In the AGI case, the extreme value math can be somewhat trivial, if you want it. One approach is to just take the optimal policy defined by a normal MDP model, and assume that the AGI has found it and is using it. If so, what unsafe phenomena might we predict? What mechanisms could we build to suppress these?

Hello new user mocny-chlapik who dropped in to tell us that talking about AGI is incoherent because of Popper, welcome to Less Wrong. Are you by chance friends with new user Hickey who dropped in a week ago to tell us that talking about AGI is incoherent because of Popper?

Are you being passive-aggressive or am I reading this wrong? :)

The user Hickey is making a different argument. He is arguing about the falsifiability of the superintelligence is coming claim. This is also an interesting question, but I was not talking about this claim in particular.

The more I read about AI Alignment, the more I have a feeling that the whole field is basically a fictional-world-building exercise.

I think a lot of forecasting is this, but with the added step of attaching probabilities and modeling.