All of NicholasKees's Comments + Replies

This avoids spending lots of time getting confused about concepts that are confusing because they were the wrong thing to think about all along, such as "what is the shape of human values?" or "what does GPT4 want?"

These sound like exactly the sort of questions I'm most interested in answering. We live in a world of minds that have values and want things, and we are trying to prevent the creation of a mind that would be extremely dangerous to that world. These kind of questions feel to me like they tend to ground us to reality.

Try out The Most Dangerous Writing App if you are looking for ways to improve your babble. It forces you to keep writing continuously for a set amount of time, or else the text will fade and you will lose everything. 

First of all, thank you so much for this post! I found it generally very convincing, but there were a few things that felt missing, and I was wondering if you could expand on them.

However, I expect that neither mechanism will produce as much of a relative jump in AI capabilities, as cultural development produced in humans. Neither mechanism would suddenly unleash an optimizer multiple orders of magnitude faster than anything that came before, as was the case when humans transitioned from biological evolution to cultural development.

Why do you expect this? ... (read more)

Are you lost and adrift, looking at the looming danger from AI and wondering how you can help? Are you feeling overwhelmed by the size and complexity of the problem, not sure where to start or what to do next?

I can't promise a lot, but if you reach out to me personally I commit to doing SOMETHING to help you help the world. Furthermore, if you are looking for specific things to do, I also have a long list of projects that need doing and questions that need answering. 

I spent so many years of my life just upskilling, because I thought I needed to be an expert to help. The truth is, there are no experts, and no time to become one. Please don't hesitate to reach out <3

Natural language is more interpretable than the inner processes of large transformers.

There's certainly something here, but it's tricky because this implicitly assumes that the transformer is using natural language in the same way that a human is. I highly recommend these posts if you haven't read them already: 

2Peter Hroššo2mo
Regarding steganography - there is the natural constraint, that the payload (hidden message) must be relatively small with respect to the main message. So this is a natural bottleneck for communication which should give us a fair advantage over the inscrutable information flows in current large models. On top of that, it seems viable to monitor cases where a so far benevolent LLM receives a seemingly benevolent message, after which it starts acting maliciously. I think the main argument behind my proposal is that if we limit the domains a particular LLM is trained on, there will be fewer emergent capabilities. Ie. a computer-science specialist may come up with steganographic messaging, but it it will be hard to spread this skill/knowledge to specialists in other domains such as biology, chemistry, humanities... And these other specialists won't be able to come up with it by themselves. They might be able to come up with other dangerous things such as bioweapons, but they won't be able to use them against us without coordination and without secure communication, etc.
1Peter Hroššo2mo
Thanks for the links, will check it out! I'm aware this proposal doesn't address deception, or side-channels communication such as steganography. But being able to understand at least the 1st level of the message, as opposed to the current state of understanding almost nothing from the weights and activations, seems like a major improvement for me.

That's a good point. There are clearly examples of systems where more is better (e.g. blockchain). There are just also other examples where this opposite seems true.

I agree that this is important. Are you more concerned about cyborgs than other human-in-the-loop systems? To me the whole point is figuring out how to make systems where the human remains fully in control (unlike, e.g. delegating to agents), and so answering this "how to say whether a person retains control" question seems critical to doing that successfully.

5David Scott Krueger (formerly: capybaralet)4mo
Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

Thank you for this gorgeously written comment. You really capture the heart of all this so perfectly, and I completely agree with your sentiments.  
 

I think it's really important for everyone to always have a trusted confidant, and to go to them directly with this sort of thing first before doing anything. It is in fact a really tough question, and no one will be good at thinking about this on their own. Also, for situations that might breed a unilateralist's curse type of thing, strongly err on the side of NOT DOING ANYTHING. 

An example I think about a lot is the naturalistic fallacy. There is a lot horrible suffering that happens in the natural world, and a lot of people seem to be way too comfortable with that. We don't have any really high leverage options right now to do anything about it, but it strikes me as plausible that even if we could do something about it, we wouldn't want to. (perhaps even even make it worse by populating other planets with life https://www.youtube.com/watch?v=HpcTJW4ur54)

2Gerald Monroe4mo
It's not just comfort. Institutions have baked in mechanisms to deflect blame when an event happens that is "natural". So rather than comparing possible outcomes as a result of their actions and always picking the best one, letting nature happen is ok. Examples: if a patient refuse treatment, letting them die "naturally" is less bad than attempting treatment and they die during the attempt. Or the NRC protecting the public from radiation leaks with heavy regulation but not from the radioactives in coal ash that will be released as a consequence of NRC decisions. Or the FDA delaying Moderna after it was ready in 1 weekend because it's natural to die of a virus.

I really loved the post! I wish more people took S-risks completely seriously before dismissing them, and you make some really great points. 

In most of your examples, however, it seems the majority of the harm is in an inability to reason about the consequences of our actions, and if humans became smarter and better informed it seems like a lot of this would be ironed out. 

I will say the hospice/euthanasia example really strikes a chord with me, but even there, isn't it more a product of cowardice than a failure of our values?

GI is very efficient, if you consider that you can reuse a lot machinery that you learn, rather than needing to relearn it over and over again. https://towardsdatascience.com/what-is-better-one-general-model-or-many-specialized-models-9500d9f8751d 

1Linda Linsefors5mo
Second reply. And this time I actually read the link. I'm not suppressed by that result. My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is.  But to use an analogy, it's something like this: In the example you gave, the AI get's better at the sub tasks by learning on a more general training set. It seems like general capabilities was useful. But consider that we just trained on even more data for a singel sub task, then wouldn't it develop general capabilities, since we just noticed that general capabilities was useful for that sub task. I was planing to say "no" but I notice that I do expect some transfer learning. I.e. if you train on just one of the dataset, I expect it to be bad at the other ones, but I also expect it to learn them quicker than without any pre-training.  I seem to expect that AI will develop general capabilities when training on rich enough data, i.e. almost any real world data. LLM is a central example of this.  I think my disagreement with at least my self from some years ago and probably some other people too (but I've been away a bit form the discourse so I'm not sure), is that I don't expect as much agentic long term planing as I used to expect. 
1Linda Linsefors5mo
I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed.  Also, it's an open question what is "enough different types of tasks". Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient.  Humans have GI to some extent, but we mostly don't use it. This is interesting. This means that a typical human environment is complex enough so that it's worth carrying around the hardware for GI. But even though we have it, it is evolutionary better to fall back at habits, or imitation, or instinkt, for most situations. Looking back to exactly what I wrote, I said there will not be any selection pressure for GI as long as other options are available. I'm not super confident in this. But if I'm going to defend it here anyway by pointing out that "as long as other options are available", is doing a lot of the work here. Some problems are only solvable by noticing deep patterns in reality, and in this case a sufficiently deep NN with sufficient training will learn this, and that is GI.

Sometimes something can be infohazardous even if it's not completely true. Even though the northwest passage didn't really exist, it inspired many European expeditions to find it. There's a lot of hype about AI right now, and I think the idea for a cool new capabilities idea (even if it turns out not to work well) can also do harm by inspiring people try similar things. 

0M. Y. Zuo5mo
But even the failed attempts at discovering the northwest passage did lead to better mapping of the area, and other benefits so it's not clear if it was net negative at all for society.

I interpret the goal as being more about figuring out how to use simulators as powerful tools to assist humans in solving alignment, and not at all shying away from the hard problems of alignment. Despite our lack of understanding of simulators, people (such as yourself) have already found them to be really useful, and I don't think it is unreasonable to expect that as we become less confused about simulators that we learn to use them in really powerful and game-changing ways. 

You gave "Google" as an example. I feel like having access to Google (or another search engine) improves my productivity by more than 100x. This seems like evidence that game-changing tools exist.

and increasing the number of actors can make collusive cooperation more difficult

An empirical counterargument to this is in the incentives human leaders face when overseeing people who might coordinate against them. When authoritarian leaders come into power they will actively purge members from their inner circles in order to keep them small. The larger the inner circle, the harder it becomes to prevent a rebellious individual from gathering the critical mass needed for a full blown coup. 

Source: The Dictator's Handbook by Bruce Bueno de Mesquita and... (read more)

4Eric Drexler4mo
Are you arguing that increasing the number of (AI) actors cannot make collusive cooperation more difficult? Even in the human case, defectors make large conspiracies more difficult, and in the non-human case, intentional diversity can almost guarantee failures of AI-to-AI alignment.

What is evolution's true goal? If it's genetic fitness, then I don't see how this demonstrates alignment. Human sexuality is still just an imperfect proxy, and doesn't point at the base objective at all. 

I agree that it's very interesting how robust this is to the environment we grow up in, and I would expect there to be valuable lessons here for how value formation happens (and how we can control this process in machines).

To me this statement seems mostly tautological. Something is instrumental if it is helpful in bringing about some kind of outcome. The term "instrumental" is always (as far as I can tell) in reference to some sort of consequence based optimization. 

I agree that this is an important difference, but I think that "surely cannot be adaptive" ignores the power of group selection effects.

2tailcalled5mo
Group selection effects aren't that strong [https://www.lesswrong.com/posts/QsMJQSFj7WfoTMNgW/the-tragedy-of-group-selectionism].

Wow, this post is fantastic! In particular I love the point you make about goal-directedness:

If a model is goal-directed with respect to some goal, it is because such goal-directed cognition was selected for.

Looking at our algorithms as selection processes that incentivize different types of cognition seems really important and underappreciated. 

Assuming that what evolution 'wants' is child-bearing heterosexual sex, then human sexuality has a large number of deviations from this in practice including homosexuality, asexuality, and various paraphilias.

I don't think this is a safe assumption. Sex also serves a social bonding function beyond procreation, and there are many theories about the potential advantages of non-heterosexual sex from an evolutionary perspective. 

A couple things you might find interesting:
-Men are 33% more likely to be gay for every older brother they have: https://pubmed.... (read more)

1beren5mo
Indeed, but insofar as this bonding function enhances IGF then this actually makes it an even more impressive example of alignment to evolution's true goal. I know that there are a bunch of potential evolutionary rationales proposed for homosexuality but I personally haven't studied it in depth nor are any super convincing to me so I'm just assuming the worst-case scenario for evolution here.
2tailcalled5mo
I think you need to distinguish between homosexuality/asexuality, which compromise on heterosexual interest and thus surely cannot be adaptive, and bisexuality, which doesn't.

It seems to occur mostly without RL. People start wanting to have sex before they have actually had sex.

This doesn't mean that it isn't a byproduct of RL. Something needs to be hardcoded, but a simple reward circuit might lead to a highly complex set of desires and cognitive machinery. I think the things you are pointing to in this post sound extremely related to what Shard Theory is trying to tackle. 
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values 

2beren5mo
Indeed, this is exactly the kind of thing I am gesturing at. Certainly, all our repertoires of sexual behaviour are significantly shaped by RL. My point is that evolution has somehow in this case mostly solved some pointers-like problem to get the reward model to suddenly include rewards for sexual behaviour, can do so robustly, and can do so a long time after birth after a decade or so of unsupervised learning and RL has already occurred. Moreover, this reward model leads to people robustly pursuing this goal even fairly off-distribution from the ancestral environment.

I am also completely against building powerful autonomous agents (albeit for different reasons), but to avoid doing this seems to require extremely high levels of coordination. All it takes is one lab to build a singleton capable of disempowering humanity. It would be great to stay in the "tool AI" regime for as long as possible, but how?

On many useful cognitive tasks(chess, theoretical research, invention, mathematics, etc.), beginner/dumb/unskilled humans are closer to a chimpanzee/rock than peak humans (for some fields, only a small minority of humans are able to perform the task at all, or perform the task in a useful manner

This seems due to the fact that most tasks are "all or nothing", or at least have a really steep learning curve. I don't think that humans differ that much in intelligence, but rather that small differences result in hugely different abilities. This is part of why I expect foom. Small improvements to an AI's cognition seem likely to deliver massive payoffs in terms of their ability to affect the world.
 

If we are able to flag a treacherous turn as cognitively anomalous, then we can take that opportunity to shut down a system and retrain on the offending datapoint.

What do you mean by "retrain on the offending datapoint"? I would be worried about Goodhearting on this by selecting for systems which don't set off the anomaly detector, and thereby making it a less reliable safeguard.

5paulfchristiano6mo
Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it's much cheaper. I'm concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won't have to do this very many times before it stops trying to take a treacherous turn.

I really enjoyed this post. No assumptions are made about the moral value of insects, but rather the author just points out just how little we ever thought about it in the first place. Given that, as a species, we already tend to ignore a lot of atrocities that form a part of our daily lives, if it WERE true, beyond a reasonable doubt, that washing our sheets killed thousands of sentient creatures, I still can't imagine we'd put in a significant effort to find an alternative. (And it certainly wouldn't be socially acceptable to have stinky sheets!) I think... (read more)

A monarch is an unincentivized incentivizer. He actually has the god’s-eye-view and is outside of and above every system. He has permanently won all competitions and is not competing for anything, and therefore he is perfectly free of Moloch and of the incentives that would otherwise channel his incentives into predetermined paths. Aside from a few very theoretical proposals like my Shining Garden, monarchy is the only system that does this.

It seems to me that a monarch is far from outside every system, and is highly dependent on their key supporters (gene... (read more)