Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I'm Ofer G. and I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

Some additional online presence: I use the username "ofer" on the EA Forum.

(Feel free to reach out by sending me a PM through LessWrong.)

ofer's Comments

What are the best arguments and/or plans for doing work in "AI policy"?

Note that research related to governments is just a part of "AI policy" (which also includes stuff like research on models/interventions related to cooperation between top AI labs and publications norms in ML).

Open question: are minimal circuits daemon-free?

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation of the first configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input , the computation of the output of simulates a consequentialist; and is a minimal circuit.

AI Alignment Open Thread October 2019

Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?

I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for minutes vs. 1,000 days.

Or rather: What is the smallest such that 'learning to generate answers that humans may give after thinking for minutes' is not easier than 'learning to generate answers that humans may give after thinking for 1,000 days'.

I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either 'supervised with automatic labeling' or unsupervised (e.g. 'predict the next token') and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).

AI Alignment Open Thread October 2019

Thank you! I think I now have a better model of how people think about factored cognition.

the 'humans thinking for 10 minutes chained together' might have very different properties from 'one human thinking for 1000 days'. But given that those have different properties, it means it might be hard to train the 'one human thinking for 1000 days' system relative to the 'thinking for 10 minutes' system, and the fact that one easily extends to the other is evidence that this isn't how thinking works.

In the above scenario I didn't assume that humans—or the model—use factored cognition when the 'thinking duration' is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single "step". Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.

Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.

AI Alignment Open Thread October 2019

[Question about factored cognition]

Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.

Suppose also that the architecture of that model is such that '10 minutes' is just a parameter, , that the operator can choose per inference, and there's no upper bound on it; and the inference runtime increases linearly with . So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.

In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with to be no more dangerous than running it many times with ?

ofer's Shortform

Nothing in life is as important as you think it is when you are thinking about it.

--Daniel Kahneman, Thinking, Fast and Slow

To the extent that the above phenomenon tends to occur, here's a fun story that attempts to explain it:

At every moment our brain can choose something to think about (like "that exchange I had with Alice last week"). How does the chosen thought get selected from the thousands of potential thoughts? Let's imagine that the brain assigns an "importance score" to each potential thought, and thoughts with a larger score are more likely to be selected. Since there are thousands of thoughts to choose from, the optimizer's curse makes our brain overestimate the importance of the thought that it ends up selecting.

What AI safety problems need solving for safe AI research assistants?

If "unintended optimization" referrers only to the inner alignment problem, then there's also the malign prior problem.

What AI safety problems need solving for safe AI research assistants?

Well, the reason I mentioned the "utility function over different states of matter" thing is because if your utility function isn't specified over states of matter, but is instead specified over your actions (e.g. behave in a way that's as corrigible as possible), you don't necessarily get instrumental convergence.

I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).

My impression is that early thinking about Oracles wasn't really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there's no real reason to believe these early "Oracle" models are an accurate description of current or future (un)supervised learning systems.

It seems possible that something like this has happened. Though as far as I know, we don't currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.

How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase "training distribution" then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?

Therefore, I'm sympathetic to the following perspective, from Armstrong and O'Rourke (2018) (the last sentence was also quoted in the grandparent):

we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

What AI safety problems need solving for safe AI research assistants?

Sorry for the delayed response!

My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I'm not sure how the concept applies in other cases.

I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.

Like, if we aren't using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals?

I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.

I'm interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that's potentially smarter than we are.

I agree. So the following is a pending question that I haven't addressed here yet: Would '(un)supervised learning at arbitrarily large scale' produce arbitrarily capable systems that "want to affect the world"?

I won't address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I'll end up writing something about it). For now I'll note that my best guess is that most AI safety researchers think that it's at least plausible (>10%) that the answer to that question is "yes".

I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world'). Here's some supporting evidence for this:

  • Stuart Armstrong and Xavier O'Rourke wrote in their Safe Uses of AI Oracles paper:

    we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

  • Stuart Russell wrote in his book Human Compatible (2019):

    if the objective of the Oracle AI system is to provide accurate answers to questions in a reasonable amount of time, it will have an incentive to break out of its cage to acquire more computational resources and to control the questioners so that they ask only simple questions.

Load More