All of ThomasW's Comments + Replies

We weren't intending to use the contest to do any direct outreach to anyone (not sure how one would do direct outreach with one liners in any case) and we didn't use it for that. I think it was less useful than I would have hoped (nearly all submissions were not very good), but ideas/anecdotes surfaced have been used in various places and as inspiration.

It is also interesting to note that the contest was very controversial on LW, essentially due to it being too political/advocacy flavored (though it wasn't intended for "political" outreach per se). I think... (read more)

2trevor7mo
Generally, it's pretty reasonable to think that there's an optimal combination of words that can prepare people to handle the reality of the situation. But that contest was the first try at crowdsourcing it, and there were invisible helicopter blades e.g. the ideal of "one-liners" steering people in bad directions, anti-outreach norms causing controversy that repelled talented writers, and possibly intervention from bots/hackers, since contests like these might have already been widely recognized as persuasion generators that crowd out the established elites in a very visible way, and the AI alignment community could not reasonably have been expected to anticipate that. When it comes to outputting optimal combinations of words, I think Critch's recent twitter post went much further (but it's very humanly possible to make an even more optimal combination than this).
2NicholasKross7mo
Full agree on all of this.

It is possible that AI would allow for the creation of brain-computer interfaces such that we can productively merge with AI systems. I don't think this would apply in that case since that would be a true "augmentation."

If that doesn't happen, though, or before that happens, I think this is a real possibility. The disanalogy is that our brains wouldn't add anything to sufficiently advanced AI systems, unlike books, which are useless without our brains to read them.

Today, many people are weaker physically than in previous times because we don't need to do a... (read more)

1Logan Zoellner8mo
  Being human is intrinsically valuable.  For certain tasks, AI simply cannot replace us.  Many people enjoy watching Magnus Carlsen play chess even though a $5 Rasberry PI computer is better at chess than him. Similarly, there are more horses in the USA today than in the 1930s when the Model-T was introduced. I haven't been able to find a definitive source, but I would be willing to bet that a typical "gym bro" is physically stronger than a typical hunter-gatherer due to better diet/training.

Hi Jan, I appreciate your feedback.

I've been helping out with this and I can say that the organizers are working as quickly as possible to verify and publish new signatures. New signatures have been published since the launch, and additional signatures will continue to be published as they are verified. There is a team of people working on it right now and has been since launch.

The main obstacles to extremely swift publication are:

  • First, determining who meets our bar for name publication. We think the letter will have greater authority (and coordination va
... (read more)
5Jan_Kulveit9mo
Thanks for the reply.  Also for the work - it's great signatures are added - before I've checked bottom of the list and it seemed it's either same or with very few additions. I do understand verification of signatures requires some amount of work. In my view having more people (could be volunteers) to process the initial expected surge of signatures fast would have been better; attention spent on this will drop fast.  

I've been collecting examples of this kind of thing for a while now here: ai-improving-ai.safe.ai.

In addition to algorithmic and data improvements I'll add there are also some examples of AI helping to design hardware (e.g. GPU architectures) and auxiliary software (e.g. for datacenter cooling).

1Diabloto9610mo
Re: The website: I'd be really great if we could control the number of shown items in the table. Being stuck at 10 is... cumbersome.

At the time of this post, the FLI letter has been signed by 1 OpenAI research scientist, 7 DeepMind research scientists/engineers, and 0 Anthropic employees. 

"1 OpenAI research scientist" felt weird to me on priors. 0 makes sense, if the company gave some guidance (e.g. legal) to not sign, or if the unanimous opinion was that it's a bad idea to sign. 7 makes sense too -- it's about what I'd expect from DeepMind and shows that there's a small contingent of people really worried about risk. Exactly 1 is really weird -- there are definitely multiple risk... (read more)

2Evan R. Murphy10mo
There are actually 3 signatories now claiming to work for for OpenAI.

Later in the thread Jan asks, "is this interpretability complete?" which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it's kind of unclear why he is calling out this in particular.

I find myself surprised/confused at his apparent surprise/confusion.

Jan doesn't indicate that he's extremely surprised or confused? He just said he doesn't know why this happens. There's a difference between being unsurprised by something (e.g. by observing something similar before) and actually knowing why it happens.To give a trivial example, hunter gatherers from 10,000 BC would not have been surprised if a lightning strike caused fire, but would be quite clueless (or incorrect) as to why or how this happens.

I think Quintin's answer is a good possible hypothesis (though of course it leads to the further question of how LLMs learn language-neutral circuitry).

3DragonGod1y
I endorse Habryaka's response. Like if you take it as a given that InstructGPT competently responds in other languages when the prompts warrants it, then I just don't think there's anything special about following instructions in other languages that merits special explanation? And following instructions in other languages was singled out as a task that merited special explanation.

I do think that we don't really understand anything in a language model at this level of detail. Like, how does a language model count? How does a language model think about the weather? How does a language model do ROT13 encoding? How does a language model answer physics questions? We don't have an answer to any of these at any acceptable level of detail, so why would we circle out our confusion about the language agnosticity, if indeed we predicted it, we just don't really understand the details the same way we don' really understand the details of anything going on in a large language model.

In addition to the post you linked, there is also an earlier post on this topic that I like.

I also co wrote a post that looks at specific structural factors related to AI safety.

Thanks so much for writing this, quite useful to see your perspective!

First, I don't think that you've added anything new to the conversation. Second, I don't think what you have mentioned even provides a useful summary of the current state of the conversation: it is neither comprehensive, nor the strongest version of various arguments already made.

Fair enough!

I don't think that's a popular opinion here. And while I think some people might just have a cluster of "brain/thinky" words in their head when they don't think about the meaning of things closely, I

... (read more)

I agree with this. If we are able to design consciousness such that a system is fulfilled by serving humans, then it's possible that would be morally alright. I don't think there is a strong enough consensus that I'd feel comfortable locking it in, but to me it seems ok.

By default though, I think we won't be designing consciousness intentionally, and it will just emerge, and I don't think that's too likely to lead to this sort of situation.

A related post: https://www.lesswrong.com/posts/xhD6SHAAE9ghKZ9HS/safetywashing

[Realized this is contained in a footnote, but leaving this comment here in case anyone missed it].

It looks like I got one or possibly two strong downvotes, but it doesn't seem like from either of the commenters. If you downvoted this (or think you understand why it was downvoted), please let me know in the comments so I can improve!

4kyleherndon1y
(This critique contains not only my own critiques, but also critiques I would expect others on this site to have) First, I don't think that you've added anything new to the conversation. Second, I don't think what you have mentioned even provides a useful summary of the current state of the conversation: it is neither comprehensive, nor the strongest version of various arguments already made. Also, I would prefer to see less of this sort of content on LessWrong. Part of that might be because it is written for a general audience, and LessWrong is not very like the general audience. This is an example of something that seems to push the conversation forward slightly, by collecting all the evidence for a particular argument and by reframing the problem as different, specific, answerable questions. While I don't think this actually "solves the hard problem of consciousness as Halberstadt notes in the comments, I think it could help clear up some confusions for you. Namely, I think it is most meaningful to start from a vaguely panpsychist model of "everything is conscious," what we mean by consciousness is "the feeling of what it is like to be" and the move on to talk about what sorts of consciousness we care about: namely consciousness that looks remotely similar to ours. In this framework, AI is already conscious, but I don't think there's any reason to care about that.   More specifics: I don't think that's a popular opinion here. And while I think some people might just have a cluster of "brain/thinky" words in their head when they don't think about the meaning of things closely, I don't think this is a popular opinion of people in general unless they're really not thinking about it. Citation needed. Assuming we make an AI conscious, and that consciousness is actually something like what we mean by it more colloquially (human-like, not just panpsychistly), it isn't clear that this makes it a moral concern.  I think there shouldn't. At least not yet. The averag

Consciousness therefore only happens if it improves performance at the task we have assigned.  And some tasks like interacting directly with humans it might improve performance.

I don't think this is necessarily true. Consciousness could be a side effect of other processes that do improve performance.

The way I've heard this put: a polar bear has thick hair so that it doesn't get too cold, and this is good for its evolutionary fitness. The fact that the hair is extremely heavy is simply a side effect of this. Consciousness could possibly me similar.

1Gerald Monroe1y
I checked and what I am proposing is called a "Markov Blanket". It makes consciousness and all the other failures of the same category unlikely.  Not impossible, but it may in practice make them unlikely enough they will never happen.   It's simple: we as humans determine exactly what the system stores in between ticks.  As all the storage bits will be for things the machine must know in order to do it's role, and there are no extra bits, consciousness and deception are unlikely. Consciousness is a subjective experience, meaning you must have memory to actually reflect on the whole I think therefore I am.  If you have no internal memory, you can't have an internal narrative.  All your bits are dedicated to tracking which shelf in the warehouse you are trying to reach and the parameters of the item you are carrying, as an example.   It makes deception also difficult, maybe impossible.  At a minimum, to have a plan to deceive, it means that when you get input "A", when it's not time to do your evil plans, you do approved action X.  You have a bit "TrueColors" that gets set when some conditions are met "Everything is in place", and when the bit is set, on input "A", you are going to do evil action Y.   Deception of all types is this: you're doing something else on the same input.   Obviously, if there are no spaces in memory to know when to do a bad thing, when you get A you have no choice but to do the good thing.  Even stochastically deciding to do bad will probably  get caught in training.

I don't think these are necessarily bad suggestions if there were a future series. But my sense is that John did this for the people in the audience, somebody asked him to record it so he did, and now he's putting them online in case they're useful to anyone. It's very hard to make good production quality lectures, and it would have required more effort. But it sounds like John knew this and decided he would rather spend his time elsewhere, which is completely his choice to make. As written, these suggestions feel a bit pushy to me.

2Caridorc Tergilti1y
Yes, the tone of my comment could be improved. I appreciate him for publishing his lessons to the community and wanted to give some suggestions to improve (eventual) future ones, if he feels like the higher quality is worth the higher effort, and with no obligation. "Al caval donato non si guarda in bocca" (You should not look at the teeth of a gift horse (to learn about its age))

Sorry if I missed it earlier in the thread, but who is this "polymath"?

2Remmelt1y
Forrest Landry.  From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a "proof" based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant. I usually do not mention Forrest Landry's name immediately for two reasons: 1. If you google his name, he comes across like a spiritual hippie. Geeks who don't understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell's Theorem is a thing) . 2. Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.  1. Both of us prefer to work behind the scenes. I've only recently started to touch on the arguments in public. 2. You can find those arguments elaborated on here.  Warning: large inferential distance; do message clarifying questions – I'm game!
2Ben Pace1y
Thread success!

I just do not think that the post is written for people who think "slowing down AI capabilities is robustly good." If people thought that, then why do they need this post? Surely they don't need somebody to tell them to think about it?

So it seems to me like the best audience for this post would be those (including those at some AI companies, or those involved in policy, which includes people reading this post) who currently think something else, for example that the robustly good thing is for their chosen group to be ahead so that they can execute whatever... (read more)

2Ben Pace1y
I see. You're not saying "staffers of the US government broadly won't find this argument persuasive", you're saying "there are some people in the AI x-risk ecosystem who don't think slowing down is robustly good, and won't find this particular argument persuasive". I have less of a disagreement with that sentence.  I'll add that: * I think most of the arguments in the post are relevant to those people, and Katja only says that these moods are "playing a role" which does not mean all people agree with them. * You write "If people thought that, then why do they need this post? Surely they don't need somebody to tell them to think about it?". Sometimes people need help noticing the implications of their beliefs, due to all sorts of motivated cognitions. I don't think the post relies on that and it shouldn't be the primary argument, but I think it's honestly helpful for some people (and was a bit helpful for me to read it).

The claim being made is something like the following:

1) AGI is a dangerous technology.

2) It is robustly good to slow down dangerous technologies.

3) Some people might say that you should not actually do this because of [complicated unintelligible reason].

4) But you should just do the thing that is more robustly good.

I argue that many people (yes, you're right, in ways that conflict with one another) believe the following:

1) X is a dangerous country.

2) It is robustly good to always be ahead of X in all technologies, including dangerous ones.

3) Some people mi... (read more)

2Ben Pace1y
This is kind of a strange comment to me. The argument, and indeed the whole post, is clearly written to people in the ecosystem ("my impression is that for people worried about extinction risk from artificial intelligence, strategies under the heading ‘actively slow down AI progress’ have historically been dismissed and ignored"), for which differential technological progress is a pretty common concept and relied upon in lots of arguments. It's pretty clear that this post is written to point out an undervalued position to those people.  Sometimes I feel like people in the AI x-risk ecosystem who interface with policy and DC replace their epistemologies with a copy of the epistemology they find in various parts of the policy-control machine in DC, in order to better predict them and perform the correct signals — asking themselves what people in DC would think, rather than what they themselves would think. I don't know why you think this post was aimed at those people, or why you point out that the post is making false inferences about its audience when the post is pretty clear that it's primary audience is the people directly in the ecosystem ("The conversation near me over the years has felt a bit like this").

There are things that are robustly good in the world, and things that are good on highly specific inside-view models and terrible if those models are wrong. Slowing dangerous tech development seems like the former, whereas forwarding arms races for dangerous tech between world superpowers seems more like the latter.

It may seem the opposite to some people. For instance, my impression is that for many adjacent to the US government, "being ahead of China in every technology" would be widely considered robustly good, and nobody would question you at all if you... (read more)

4Ben Pace1y
I'm confused, of course the people in government in every country thinks that they should have more global power, but this doesn't seem like something everyone (i.e. including people in all of the other countries) would agree is robustly good, and I don't think you should think so either (for any country, be it Saudi Arabia, France, or South Korea). I am not aware of a coherent perspective that says "slowing down dangerous tech development" is not robustly good in most situations (conditional on our civilization's inability to "put black balls back into the urn", a la Bostrom). Your argument sounds to me like "A small group with a lot of political power disagrees with your claim therefore it cannot be accepted as true." Care to make a better argument?

I think this is all true, but also since Yale CS is ranked poorly the graduate students are not very strong for the most part. You certainly have less competition for them if you are a professor, but my impression is few top graduate students want to go to Yale. In fact, my general impression is often the undergraduates are stronger researchers than the graduate students (and then they go on to PhDs at higher ranked places than Yale).

Yale is working on strengthening its CS department and it certainly has a lot of money to do that. But there are a lot of re... (read more)

I'll just comment on my experience as an undergrad at Yale in case it's useful.

At Yale, the CS department, particularly when it comes to state of the art ML, is not very strong. There are a few professors who do good work, but Yale is much stronger in social robotics and there is also some ML theory. There are a couple AI ethics people at Yale, and there soon will be a "digital ethics" person, but there aren't any AI safety people.

That said, there is a lot of latent support for AI safety at Yale. One of the global affairs professors involved in the Schmidt... (read more)

2scasper1y
This is great thanks. It seems like someone wanting a large team of existing people with technical talent is a reason to not work somewhere like Yale. But what are the chances that the presence of lots of money and smart people would make this possible in the future? Is Yale working on strengthening its cs department? One of my ideas behind this post is that being the first person doing certain work in a department that has potential might have some advantages compared to being the 5th in a department that has already realized it’s potential. An ai safety professor at Yale might get invited to a lot of things, have little competition for advisees, be more uniquely known within Yale, and provide advocacy for ai safety in a way that counterfactually would not happen otherwise at the university.

I appreciate this. I don't even consider myself part of the rationality community, though I'm adjacent. My reasons for not drinking have nothing to do with the community and existed before I knew what it was. I actually get the sense this is the case for a number of people in the community (more of a correlation or common cause rather than caused by the community itself). But of course I can't speak for all.

I will be trying it on Sunday. We will see how it is.

I've thought about this comment, because it certainly is interesting. I think I was clearly confused in my questions to ChatGPT (though I will note: My tequila-drinking friends did not and still don't think tequila tastes at all sweet, including "in the flavor profile" or anything like that. But it seems many would say they're wrong!) ChatGPT was clearly confused in its response to me as well. 

I think this part of my post was incorrect:

It was perfectly clear: ChatGPT was telling me that tequila adds a sweetness to the drink. So it was telling me that

... (read more)
1green_leaf1y
You're right, ChatGPT did contradict itself and the chatbot it created based on the prompt (assuming it was all a part of a single conversation) tried to gaslight you.
-9Douglas_Knight1y

It should! I mentioned that probable future outcome in my original post.

I'm going to address your last paragraph first, because I think it's important for me to respond to, not just for you and me but for others who may be reading this.

When I originally wrote this post, it was because I had asked ChatGPT a genuine question about a drink I wanted to make. I don't drink alcohol, and I never have. I've found that even mentioning this fact sometimes produces responses like yours, and it's not uncommon for people to think I am mentioning it as some kind of performative virtue signal. People choose not to drink for all sorts of reas... (read more)

3Robert Kennedy1y
I am sorry for insulting you. My experience in the rationality community is that many people choose abstinence from alcohol, which I can respect, but I forgot that likely in many social circles that choice leads to feelings of alienation. While I thought you were signaling in-group allegiance, I can see that you might not have that connection. I will attempt to model better in the future, since this seems generalizable.   I'm still interested in whether the beet margarita with OJ was good~
8ChristianKl1y
OpenAI should likely explicitly train ChatGPT to be able to admit it's errors. 

Interesting! I hadn't come across that. Maybe ChatGPT is right that there is sweetness (perhaps to somebody with trained taste) that doesn't come from sugar. However, the blatant contradictions remain (ChatGPT certainly wasn't saying that at the beginning of the transcript).

OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it's trained with reinforcement learning from human feedback.

However, of course it was initially pretrained in an unsupervised fashion (it's based on GPT-3), so it seems hard to know whether this specific behavior was "due to the RL" or "a likely continuation".

This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you're interested, here's a good talk.

2Charlie Steiner1y
Yes, I was deliberately phrasing things sort of like transformative experiences :P

That's fair. I think it's a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don't think just asking you questions "when it's confused" is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that's not research that's currently happening (though there might be something I'm not aware of).

I think if they operationalized it like that, fine, but I would find the frame "solving the problem" to be a very weird way of referring to that. Usually, when I hear people saying "solving the problem" they have a vague sense of what they are meaning, and have implicitly abstracted away the fact that there are many continuous problems where progress needs to be made and that the problem can only really be reduced, but never solved, unless there is actually a mathematical proof.

I'm not a big supporter of RLHF myself, but my steelman is something like:

RLHF is a pretty general framework for conforming a system to optimize for something that can't clearly be defined. If we just did, "if a human looked at this for a second would they like it?" this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. O... (read more)

I think I essentially agree with respect to your definition of "sanity," and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of "sanity" and my definition of "safety culture." I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.

As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned a... (read more)

I think RLHF doesn't make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can't route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.

Like, I don't see how RLHF "chips away at the proxy gaming problem". In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.

I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.

I wouldn't be excited about (for example) retreats with undergrads to learn about "how you can help buy more time." I'm not even sure of the sign of int... (read more)

7JakubK1y
I worry that people will skip the post, read this comment, and misunderstand the post, so I want to point out how this comment might be misleading, even though it's a great comment. None of the interventions in the post are "go work at OpenAI to change things from the inside." And only the outreach ones sound anything like "going around and convincing others." And there's a disclaimer that these interventions have serious downside risks and require extremely competent execution. EDIT: one idea in the 2nd post is to join safety and governance teams at top labs like OpenAI. This seems reasonable to me? ("Go work on capabilities at OpenAI to change things" would sound unreasonable.)

I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely "concerned" does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):

Sorry, I don't want to say that safety culture is not relevant, but I want to say something like... "the real safety culture is just 'sanity'".

Like, I think you are largely playing a losing game if you try to get people to care about "safety" if the people making decisions are incapably of considering long-chained arguments, or are ... (read more)

I feel somewhat conflicted about this post. I think a lot of the points are essentially true.  For instance, I think it would be good if timelines could be longer all else equal. I also would love more coordination between AI labs. I also would like more people in AI labs to start paying attention to AI safety.

But I don't really like the bottom line. The point of all of the above is not reducible to just "getting more time for the problem to be solved." 

First of all, the framing of "solving the problem" is, in my view, misplaced. Unless you think... (read more)

2JakubK1y
I'm confused why this is an objection. I agree that the authors should be specific about what it means to "solve the problem," but all they need is a definition like "<10% chance of AI killing >1 billion people within 5 years of the development of AGI."

As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent - the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.

Your argument seems to be:

  1. Definitionally, intelligence is the ability to achieve one's goals.
  2. Less goal-directed systems are less intell
... (read more)
3Karl von Wendt1y
Thank you very much for your input! Admittedly, my reply to A was a bit short. I only wanted to point out that intelligence is closely linked to goal-directedness, not that they're the same thing (heat-seeking missiles are stupid, but very goal-directed entities, for example). A very intelligent system without a goal would just sit around, doing nothing. It might be able to potentially act intelligently, but without a goal it would behave like an unintelligent system. "Always" may be too strong a word, but if system X is more intelligent and wants to reach a conflicting goal much more than system Y, chances are that system X will get what it wants. I disagree. Being all-powerful does not mean always doing everything you want, or everything your partner wants. It means being able to do whatever you want, or maybe more importantly, whatever you feel you need to do. If, for example, I needed the magic wand to prevent the untimely death of someone I love, I would use it without a second thought. I tend to agree, but I guess there are many people who have been less lucky in their relationships than I have, being happily together with my wife for more than 44 years. :) Maybe not everyone and certainly not all the time, but I'm quite sure that most people would use it at least once in a while.

I think it's tangentially relevant in certain cases. Here's something I wrote in another context, where I think it's perhaps useful to understand what we really mean when we say "intelligence."

We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is somet

... (read more)
2Morpheus1y
I think I disagree (that generality is irrelevant (this is only the case, because nfl-theorems use unreasonable priors)). If your problem has "any" structure: your environment is not maximally random, then you can use Occam's razor and make sense of your environment. No need for the "real world". The paper on universal intelligence is great, by the way, if formalizing intelligence seems interesting.

(Not reviewed by Dan Hendrycks.)

This post is about epistemics, not about safety techniques, which are covered in later parts of the sequence. Machine learning, specifically deep learning, is the dominant paradigm that people believe will lead to AGI. The researchers who are advancing the machine learning field have proven quite good at doing so, insofar as they have created rapid capabilities advancements. This post sought to give an overview of how they do this, which is in my view extremely useful information! We strongly do not favor advancements of cap... (read more)

3Morgan_Rogers1y
This is what I was trying to question with my comment above: Why do you think this? How am I to use this information? It's surely true that this is a community that needs to be convinced of the importance of work on safety, as you point out in the next post in the sequence, but how does information about, say, the turnover of ML PhD students help me do that? There is conflation happening here which undermines your argument: theoretical approaches dominated how machine learning systems were shaped for decades, and you say so at the start of this post. It turned out that automated learning produced better results in terms of capabilities, and it is that success that makes it the continued default. But the former fact surely says a lot more about whether or not theory can "shape machine learning systems" than the latter. Following through with your argument, I would instead conclude that implementing theoretical approaches to safety might require us to compromise on capabilities, and this is indeed exactly what I expect: learning systems would have access to much more delicious data if they ignored privacy regulations and other similar ethical boundaries, but safety demands that capability is not the singular shaping consideration in AI systems. This is simply not true. Failure modes which were identified by purely theoretical arguments have been realised in ML systems. System attacks and pathological behaviour (for image classifiers, say) are regularly built in theory before they ever meet real systems. It's also worth noting that any architecture choices or to, say, make backprop more algorithmically efficient, are driven by theory. In the end, my attitude is not that "iterative engineering practices will never ensure safety", but rather that there are plenty of people already doing iterative engineering, and that while it's great to convince as many of those as possible to be safety-conscious, there would be further benefits to safety if some of their experience

From my comments on the MLSS project submission (which aren't intended to be comprehensive):

Quite enjoyed reading this, thanks for writing!

My guess is that the factors combine to create a roughly linear model. Even if progress is unpredictable and not linear, the average rate of progress will still be linear.

I’m very skeptical that this is a linear interpolation. It’s the core of your argument, but I didn’t think it was really argued. I would be very surprised if moving from 50% to 49% risk took similar time as moving from 2% to 1% risk, even if there are ... (read more)

1Stephen McAleese1y
I started with the assumption that alignment progress would have diminishing returns. Then the two other factors I took into account were the increasing relevance of alignment research over time[1] and an increasing number of alignment researchers. My model was that the diminishing returns would be canceled out by the increasing number of researchers and increasing relevance. It seems like you're emphasizing the importance of diminishing returns. If diminishing returns are more important than the other two factors, progress would slow down over time. I'm not sure which factors are most influential though I may have been underestimating the importance of diminishing returns. Quote on how AI could reduce AI risk: I think you're referring to this quote: I think I could have explained this point more. I think existential risk levels would fall to very low levels after an aligned ASI is created by definition: 1. If the AI were aligned, then the AI itself would be a low source of existential risk. 2. If it's also superintelligent, it should be powerful enough to strongly reduce all existential risks. Those are some good points on cognitively enhanced humans. I don't think I emphasized the downsides enough. Maybe I need to expand that section. 1. ^ Toby Ord calls this decreasing nearsightedness.

Thanks! I really appreciate it, and think it's a lot more accurate now. Nitpicks:

I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.

2Thomas Larsen1y
Right! I've changed both. 

As somebody who used to be an intern at CHAI, but certainly isn't speaking for the organization:

CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it's ML research, but it's not top down at all so it doesn't feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.

Also, as to your comment:

My worry is that academics will pursue strategies that work right now but won't work for AGI, because they are trying to win the competition instead of align AGIs. This might be really helpful though.

(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, ... (read more)

8Thomas Larsen1y
Yeah I think the difficulty of setting this up correctly is the main crux. I'm quite uncertain on this, but I'll give the argument my model of John Wentworth makes against this:  The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan's really well, it's plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.  I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.  So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place.  Could you link an example? I am curious what you have in mind. I'm guessing something like the ROME paper? 

Thanks so much for writing this! I think it's a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.

[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn't been reviewed by others at CAIS, but I'm pretty confident it's accurate.]

For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I'd probably chan... (read more)

8Thomas Larsen1y
Thank you Thomas, I really appreciate you taking the time to write out your comment, it is very useful feedback.  I've linked your comment in the post and rewritten the description of CAIS. 
6ThomasW1y
Also, as to your comment: (My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren't explicitly thinking about x-risk (mostly "accidentally", i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change). Also, iterative progress involves making progress that works now but might not in the future. That's ok, as long as some of it does in fact work in the future.

I think the meaning of "distillation" is used differently by different people, and this confuses me. Originally (based on John Wentworth's post) I thought that "distillation" meant:

"Take existing ideas from a single existing work (or from a particular person) and present them in a way that is more understandable or more concise."

That's also the definition seemingly used by the Distillation Contest.

But then you give these examples in your curriculum:

Bushwackers:

  • Explain how a sharp left turn would look like in current ML paradigms
  • Explain the connection betwe
... (read more)
1Jonas Hallgren2y
First and foremost, my confidence in the descriptions of different distillation methods is pretty low. It is a framework I've thrown together from discussions on what an optimal science communication landscape would look like. It is in its initial phases and will most likely be imperfect for quite some time as finding the optimal communication landscape is a difficult problem.  Secondly, Great point! I think that my thinking of it, is as a "reinterpretation of existing research." The basic way of doing this is rewriting a post for higher clarity which is the classical way that a distillation is viewed from.  I think there are more ways of doing this and that the space is underexplored. In terms of the terminology proposed in the course, a "classic" distillation is some combination of what I would describe as propagating and bushwacking. Bushwacking would be more something like asking, "what the f*ck is going on here?" which might be relevant for things such as infra-bayesianism (I want to learn infra-bayesianism can someone please bushwack this).  Propagating would be more of what Rob Miles is doing.  So what is distillation? What is the superclass of all of these?  I would phrase it like the following "A distillation is a work that takes existing research and reinterprets it in a new light."    Finally, a meta point in defence of the introduction of new jargon. I think the term distillation is confusing in itself as it can mean a lot of things, and therefore if you say, "I'm bushwhacking this post" you get the idea that "ah, this person is cutting down the weeds of what is a confusing post". I hope to introduce new methodology so it is easier to understand what type of distillation someone is doing. (I don't think this terminology is optimal, but it's a start in the right direction IMO.)

Thanks for the suggestion, Richard! It actually probably fits best under one of the forthcoming lectures, but for now we added it to emergent behavior.

These are already the top ~10%, the vast majority of the submissions aren't included. We didn't feel we really had enough data to accurately rank within these top 80 or so, though some are certainly better than others. Also, it really depends on the point you're trying to make or the audience, I don't think there really exists an objective ordering.

We did do categorization at one point, but many points fall into multiple categories and there are a lot of individual points such that we didn't find it very useful when we had them categorized.

I'm not sure what you mean by "using bullet-pointed summaries of the 7 works stated in the post". If you mean the past examples of good materials, I'm not sure how good of an idea that is. We don't just want people to be rephrasings/"distillations" of single pieces of prior work.

I'm also not sure we literally tell you how to win, but yes, reading the instructions would be useful.

4trevor2y
I meant, reading them and making bullet pointed lists of all valuable statements, in order to minimize the risk of forgetting something that could have been a valuable addition.  You make a very good point that there's pitfalls with this strategy, like having a summary of too many details when the important thing is galaxy-brain framing that will demonstrate the problem to different types of influential people with the maximum success rate. I think actually reading (and taking notes) on most/all of the 7 recommended papers that you guys listed is generally a winning strategy, both for winning the contest and for winning at solving alignment in time. But only for people who can do it without forgetting that they're making something optimal/inspirational for minimizing absurdity heuristic, not fitting as many cohesive logic statements as they can onto a single sheet of paper. In my experience, constantly thinking about the reader (and even getting test-readers) is a pretty fail-safe way to get that right.

I made this link which redirects to all arxiv pages from the last day on AI, ML, Computation and Language, Computer Vision, and Computers and Society into a single view. Since some papers are listed under multiple areas I prefer to view this so I don't skim over the same paper twice. If you bookmark it's just one click per day!

2DavidHolmes2y
If you get the daily arXiv email feeds for multiple areas it automatically removes duplicates (i.e. each paper appears exactly once, regardless of cross-listing). The email is not to everyone's taste of course, but this is a nice aspect of it.
1scasper2y
I'll use this from now on. 

No need to delete the tweet. I dagree the examples are not info hazards, they're all publicly known. I just probably wouldn't want somebody going to good ML researchers who currently are doing something that isn't really capabilities (e.g., application of ML to some other area) and telling them "look at this, AGI soon."

Load More