Loppukilpailija

Wiki Contributions

Comments

This survey is really good!

Speaking as someone who's exploring the AI governance landscape: I found the list of intermediate goals, together with the responses, a valuable compilation of ideas. In particular it made me appreciate how large the surface area is (in stark contrast to takes on how progress in technical AI alignment doesn't scale). I would definitely recommend this to people new to AI governance.

For coordination purposes, I think it would be useful for those who plan on submitting a response mark that they'll do so, and perhaps tell a little about the contents of their response. It would also be useful for those who don't plan on responding to explain why not.

The last paragraph stood out to me (emphasis mine).

Second, we believe it would be unintuitively risky and difficult to stop the creation of superintelligence. Because the upsides are so tremendous, the cost to build it decreases each year, the number of actors building it is rapidly increasing, and it’s inherently part of the technological path we are on, stopping it would require something like a global surveillance regime, and even that isn’t guaranteed to work. So we have to get it right.

There are efforts in AI governance that definitely don't look like "global surveillance regime"! Taking the part above at face value, the authors seem to think that such efforts are not sufficient. But earlier on the post they talk about useful things that one could do in the AI governance field (lab coordination, independent IAEA-like authority), so I'm left confused about the authors' models of what's feasible and what's not.

The passage also makes me worried that the authors are, despite their encouragement of coordination and audits, skeptical or even opposing to efforts to stop building dangerous AIs. (Perhaps this should have already been obvious from OpenAI pushing the capabilities frontier, but anyways.)

Regarding betting odds: are you aware of this post? It gives a betting algorithm that satisfies both of the following conditions:

  • Honesty: participants maximize their expected value by being reporting their probabilities honestly.
  • Fairness: participants' (subjective) expected values are equal.

The solution is "the 'loser' pays the 'winner' the difference of their Brier scores, multiplied by some pre-determined constant C". This constant C puts an upper bound on the amount of money you can lose. (Ideally C should be fixed before bettors give their odds, because otherwise the honesty desideratum above could break, but I don't think that's a problem here.)

On premature advice

Here's a pattern I've recognized - all examples are based on real events.

Scenario 1. Starting to exercise

Alice: "I've just started working out again. I've been doing blah for X minutes and then blah blah for Y minutes."

Bob: "You shouldn't exercise like that, you'll injure yourself. Here's what you should be doing instead..."

Result: Alice stops exercising.

Scenario 2. Starting to invest

Alice: "Everyone around me tells that investing is a good idea, so I'm now going to invest in index funds."

Bob: "You better know what you are doing. Don't invest any money you cannot afford to lose, Past Performance Is No Guarantee of Future Results, also [speculation] so this might not be the best time to invest, also..."

Result: Alice doesn't invest any of her money anywhere

Scenario 3. Buying lighting

Alice: "My current lighting is quite dim, I'm planning on buying more and better lamps."

Bob: "Lighting is complicated: you have to look at temperatures and color reproduction index, make sure to have shaders, also ideally you have colder lighting in the morning and warmer in the evening, and..."

Result: Alice doesn't improve her lighting.


I think this pattern, namely overwhelming a beginner with technical nuanced advice (that possibly was not even asked for), is bad, and Bobs shouldn't do that.

An obvious improvement is to not be as discouraging as Bob in the examples above, but it's still tricky to actually make things better instead of demotivating Alice.

When I'm Alice, I often just want to share something I've been thinking about recently, and maybe get some encouragement. Hearing Bob tell me how much I don't know doesn't make me go learn about the topic (that's a fabricated option), it makes me discouraged and possibly give up.

My memories of being Bob are not as easily accessible, but I can guess what it's like. Probably it's "yay, Alice is thinking about something I know about, I can help her!", sliding into "it's fun to talk about subjects I know about" all the way to "you fool, look how much more I know than you". 

What I think Bob should do, and what I'll do when encountering an Alice, is to be more supportive and perhaps encourage them to talk more about the thing they seem to want to talk about. 

I feel like the post proves too much: it gives arguments for why foom is unlikely, but I don't see arguments which break the symmetry between "humans cannot foom relative to other animals" and "AI cannot foom relative to humans".* For example, the statements

brains are already reasonably pareto-efficient 

and

Intelligence requires/consumes compute in predictable ways, and progress is largely smooth.

seem irrelevant or false in light of the human-chimp example. (Are animal brains pareto-efficient? If not, I'm interested in what breaks the symmetry between humans and other animals. If yes, pareto-efficiency doesn't seem that useful for making predictions on capabilities/foom.)

*One way to resolve the situation is by denying that humans foomed (in a sense relevant for AI), but this is not the route taken in the post.



Separately, I disagree with many claims and the overall thrust in the discussion of AlphaZero.

Go is extremely simple [...] This means that the Go predictive capability of a NN model as a function of NN size completely flatlines at an extremely small size.

This seems unlikely to me, depending on what "completely flatlines" and "extremely small size" mean.

Games like Go or chess are far too small for a vast NN like the brain, so the vast bulk of its great computational power is wasted.

Go and chess being small/simple doesn't seem like the reason why ANNs are way better than brains there. Or, if it is, we should see the difference between ANNs and brains shrinking as the environment gets larger/more complex. This model doesn't seem to lead to good predictions, though: Dota 2 is a lot more complicated than Go and chess, and yet we have superhuman performance there. Or how complicated exactly does a task need to be before ANNs and brains are equally good?

(Perhaps relatedly: There seems to be an implicit assumption that AGI will be an LLM. "The AGI we actually have simply reproduces [cognitive biases], because we train AI on human thoughts". This is not obvious to me - what happened to RL?)

On a higher level, the whole train of reasoning reads like a just-so story to me: "We have obtained superhuman performance in Go, but this is only because of training on vastly more data and the environment being simple. As the task gets more complicated the brain becomes more competitive. And indeed, LLMs are close to but not quite human intelligences!". I don't see this is as a particularly good fit to the datapoints, or how this hypothesis is likelier than "There is room above human capabilities in ~every task, and we have achieved superhuman abilities in some tasks but not others (yet)".

My thoughts on the "Humans vs. chimps" section (which I found confusing/unconvincing):

Chimpanzees have brains only ~3x smaller than humans, but are much worse at making technology (or doing science, or accumulating culture…). If evolution were selecting primarily or in large part for technological aptitude, then the difference between chimps and humans would suggest that tripling compute and doing a tiny bit of additional fine-tuning can radically expand power, undermining the continuous change story.

But chimp evolution is not primarily selecting for making and using technology, for doing science, or for facilitating cultural accumulation.

For me the main takeaway of the human vs. chimp story to be information about the structure of mind space, namely that there are discontinuities in terms of real world consequences. 

Evolution changes continuously on the narrow metric it is optimizing, but can change extremely rapidly on other metrics. For human technology, features of the technology that aren’t being optimized change rapidly all the time. When humans build AI, they will be optimizing for usefulness, and so progress in usefulness is much more likely to be linear.

I don't see how "humans are optimizing AI systems for usefulness" undermines the point about mind space - if there are discontinuities in capabilities / resulting consequences, I don't see how optimizing for capabilities / consequences makes things any more continuous. 

Also, there is a difference between "usefulness" and (say) "capability of causing human extinction", just as there is a difference between "inclusive genetic fitness" and "intelligence". Cf. it being hard to get LLMs do what you want them to do, and the difference between the publicity* of ChatGPT and other GPT-3 models is more about usability and UI instead of the underlying capabilities.

*Publicity is a different thing from usefulness. Lacking a more narrow definition of usefulness, I still would argue that to many people ChatGPT is more useful than other GPT models.

Our planet is full of groups of power-seekers competing against each other. Each one of them could cooperate (join in the moratorium) defect (publicly refuse) or stealth-defect (proclaim that they're cooperating while stealthily defecting). The call for a moratorium amounts to saying to every one of those groups "you should choose to lose power relative to those who stealth-defect". It doesn't take much decision theory to predict that the result will be a covert arms race conducted in a climate of fear by the most secretive and paranoid among the power groups.

 

There seems to be an underlying assumption that the number of stealth-defecting AI labs doing GPT-4-level training runs is non-zero. This is a non-trivial claim and I'm not sure I agree. My impression is that there are few AI labs world-wide that are capable of training such models in the next 6-12 months and we more or less know what they are.

I also disagree with the framing of stealth-defection of being a relatively trivial operation which is better than cooperation, mostly because training such models takes a lot of people (just look at pages 15-17 in the GPT-4 paper!) and thus the probability of someone whistleblowing is large.

And for what it's worth, I would really have hoped that such things are discussed in a post that starts with a phrase of the form "All the smart people [...] seem to have unaccountably lost their ability to do elementary game theory".

Inspired by the "reward chisels cognition into the agent's network" framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?

I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were "yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough". 

I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.


Example 1: Q-learning. You have a set  of states and a set  of actions. Given a target policy , can you necessarily choose a reward function  such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ?

*and assuming we visit all of the states in  many times

The answer is yes. Simply reward the behavior you want to see: let  if  and  otherwise.

(In fact, one can more strongly choose, for any target value function  , a reward function  such that the values  in Q-learning converge in the limit to . So not only can you force certain behavior out of the model, you can also choose the internals.)

Example 2: Neural network.

Say you have a neural network  with  tunable weights . Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to ?

(I'm assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)

The following sketch convinces me that the answer is positive:

Choose  random input-output pairs . The gradients  of the weight vectors are almost certainly linearly independent. Hence, some linear combination  of them equals . Now, for small , running back-propagation on the pair  with learning rate  for all  gives you an update approximately in the direction of . Rinse and repeat.

Feature suggestion: Allow one to sort a user's comments by the number of votes.

Context: I saw a comment by Paul Christiano, and realized that probably a significant portion of the views expressed by a person lie in comments, not top-level posts. However, many people (such as Christiano) have written a lot of comments, so sorting them would allow one to find more valuable comments more easily.

Load More