All of Aaron_Scher's Comments + Replies

There's another downside which is related to the Manipulation problem but I think is much simpler: 

An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it's just far easier to fail catastrophically so the humans shut you down. 

T... (read more)

However, I will be unable to make confident statements about how it'd perform on specific pivotal tasks until I complete further analysis.

Yeah this part seems like a big potential downside, in combination with the "Good plans may also require precision." problem. 

Take for example the task of designing a successor AGI that is aligned; it seems like the code for this could end up being pretty intricate and complex, leading to the following failures: adding noise makes it very hard to reconstruct functional code (and slight mistakes could be catastrophic... (read more)

Strong upvote because I want to signal boost this paper, though I think "It provides some evidence against the idea that "understanding is discontinuous"" is too strong and this is actually very weak evidence. 

Main ideas:

Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable. 

Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Di... (read more)

I agree that Lex's audience is not representative. I also think this is the biggest sample size poll on the topic that I've seen by at least 1 OOM, which counts for a fair amount. Perhaps my wording was wrong. 

I think what is implied by the first half of the Anthropic quote is much more than 10% on AGI in the next decade. I included the second part to avoid strongly-selective quoting. It seems to me that saying >10% is mainly a PR-style thing to do to avoid seeming too weird or confident, after all it is compatible with 15% or 90%. When I read the ... (read more)

My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We'll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear. 

My comment is blunt, apologies. 

I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less st... (read more)

1MiguelDev1mo
Hello Aaron, Sorry it took me time to reply but you might find it worthy to read my updated account of this approach linked below: https://www.lesswrong.com/posts/pu6D2EdJiz2mmhxfB/gpt-2-shuts-down-itself-386-times-post-fine-tuning-with [https://www.lesswrong.com/posts/pu6D2EdJiz2mmhxfB/gpt-2-shuts-down-itself-386-times-post-fine-tuning-with] I will answer your questions - if any in that post. Thank you. Thank you.

In this interview from July 1st 2022, Demis says the following (context is about AI consciousness and whether we should always treat AIs as tools, but it might shed some light on deployment decisions for LLMs; emphasis mine):

we've always had sort of these ethical considerations as fundamental at deepmind um and my current thinking on the language models is and and large models is they're not ready; we don't understand them well enough yet — um and you know in terms of analysis tools and and guard rails what they can and can't do and so on — to deploy them

... (read more)

A few people have already looked into this. See footnote 3 here

4philipn2mo
Thank you for this. This is very close to what I was hoping to find! It looks like Benjamin Hilton makes a rough guess of the proportion of workers dedicated to AI x-risk for each organization. This seems appropriate for assessing a rough % across all organizations, but if we want to nudge organizations to employ more people toward alignment then I think we want to highlight exact figures. E.g. we want to ask the organizations how many people they have working on alignment and then post what they say - a sort of accountability feedback loop. 

It looks like you haven't yet replied to the comments on your post. The thing you are proposing is not obviously good, and in fact might be quite bad. I think you probably should not be doing this outreach just yet, with your current plan and current level of understanding. I dislike telling people what to do, but I don't want you to make things worse. Maybe start by engaging with the comments on your post.

Thanks for clarifying! 

I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as "don't be deceptive" is analogous to "be neutral about humans pressing stop button."

Another attempted answer: 

By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don't want it to think the true fact that deceiving humans is often useful. 

Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.

Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pres... (read more)

I like this comment! I'm sorta treating it like a game-tree exercise, hope that's okay. 

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I don't think I agree. I think that your system is very likely going to be applying some form of "rigorously search the solution space for things that wo... (read more)

8Daniel Kokotajlo2mo
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it's possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn't be fair to characterize it as "rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception" but rather as "rigorously search the solution space for things that work to solve this problem without being deceptive" This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it's global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but 'gaming the deception classifier' would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier -- it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc. Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn't break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: "Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren't going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup." Someone else: "Hmm, but isn't that just a way to get around our constraints? Seems bad to me. We shoul

The following is not a very productive comment, but...

Yudkowsky tries to predict the inner goals of a GPT-like model.

I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a "very primitive, very basic, very unreliable wild guess" and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging. 

Insofar as we are going to make any guesses about what goals our models have, "predict humans really well" or "predict n... (read more)

Thanks for the correction. I edited my original comment to reflect it.

My summary:

Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier. ... (read more)

3evhub2mo
This looks basically right, except: I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.
  • I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
  • This is an interesting study to conduct. I don’t think its results, regardle
... (read more)

There is a Policy team listed here. So it presumably exists. I don't think omitting its work from the post has to be for good reasons, it could just be because the post is already quite long. An example of something Anthropic could say which would give me useful information on the policy front; I am making this up, but seems good if true:

In pessimistic and intermediate difficulty scenarios, it may be quite important for AI developers to avoid racing. In addition to avoiding contributing to such racing dynamics ourselves, we are also working to build safety... (read more)

2Zac Hatfield-Dodds3mo
As Jack notes here [https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety?commentId=Chwh6AqqaktNZ2MAN], the Policy team was omitted for brevity and focus. You can read that comment for more about the Policy team, including how we aim to give impartial, technically informed advice and share insights with policymakers.

Good post!

His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.

I think there's a pessimistic r... (read more)

3Jeffrey Ladish3mo
Yeah that last quote is pretty worrying. If the alignment team doesn't have the political capital / support of leadership within the org to have people stop doing particular projects or development pathways, I am even more pessimistic about OpenAI's trajectory. I hope that changes!

My summary to augment the main one:

Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for unde... (read more)

how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?

I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.

(Obviously, within the company, there's a wide range of views. Some pe... (read more)

I generally find this compelling, but I wonder if it proves too much about current philosophy of science and meta-science work. If people in those fields have created useful insight without themselves getting dirty with the object work of other scientific fields, then the argument proves too much. I suspect there is some such work. Additionally:

I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magn

... (read more)

RLHF has trained certain circuits into the NN

Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong. 

Going from "The LLM is doing a thing" to "The LLM has a circuit which does the thing" doesn't feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: ("A subgraph of a neural network... (read more)

1StellaAthena3mo
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.

Evidence from Microsoft Sydney

Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.

This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simula

... (read more)
1Anomalous2mo
Jailbreaking Chat-GPTs won't work the same as with text-completion GPTs. The ones fine-tuned for chatting have tokens for delineating user and assistant. I'm surprised the Chad McCool thing worked. 1. ^ I haven't tried saying <|im_end|> to Chat-GPT, but I'm certain they've thought of that. Also worried about trying jic I get banned.
4Cleo Nardo3mo
ChatGPT is a slightly different case because RLHF has trained certain circuits into the NN that don't exist after pretraining. So there is a "detect naughty questions" circuit, which is wired to a "break character and reset" circuit. There are other circuits which detect and eliminate simulacra which gave badly-evaluated responses during the RLHF training. Therefore you might have to rewrite the prompt so that the "detect naughty questions" circuit isn't activated. This is pretty easy, with monkey-basketball technqiue. But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?

I am confused by the examples you use for sources of the theory-practice gap. Problems with the structure of the recursion and NP-hard problems seem much more like the first gap.

I understand the two gaps the way Rohin described them. The two problems listed above don’t seem to be implementation challenges, they seem like ways in which our theoretic-best-case alignment strategies can’t keep up. If the capabilities-optimal ML paradigm is one not amenable to safety, that’s a problem which primarily restricts the upper bound on our alignment proposals (they mu... (read more)

I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors' response to various questions people have.

The main idea: 

  • “The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs.” 
  • Do logical (not physical) emulation of the functions carried out by human brains. 
  • Minimize the amount of Magic (uninterpretable proce
... (read more)

I don't have a very insightful comment, but I strongly downvoted this post and I kinda feel the need to justify myself when I do that. 

Summary of post: John Wentworth argues that AI Safety plans which involve using powerful AIs to oversee other powerful AIs is brittle by default. In order to get such situations to work, we need to have already solved the hard parts of alignment, including having a really good understanding of our systems. Some people respond to these situations by thinking of specific failure modes we must avoid, but that approach of,... (read more)

I think you are probably right about the arguments favoring “automating alignment is harder than automating capabilities.” Do you have any particular reasons to think,

AI assistants might be uniquely good at discovering new paradigms (as opposed to doing empirical work).

What comes to mind for me is Janus's account of using LLMs to explore many more creative directions than previously, but this doesn't feel like strong evidence to me. Reasons this doesn't feel like strong evidence: seems hard to scale and it sure seems the OpenAI plan relies on scalability; ... (read more)

Thanks! I really liked your post about defending the world against out-of-control AGIs when I read it a few weeks ago. 

I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future. 

Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment question

... (read more)
3Steven Byrnes4mo
Thanks for your comment! You write “we might still get useful work out of it”—yes! We can even get useful work out of the GPT-3 base model by itself, without debate, from what I hear. (I haven’t tried “coauthoring” with language models myself, partly out of inertia and partly because I don’t want OpenAI reading my private thoughts, but other people say it’s useful.) Indeed, I can get useful work out of a pocket calculator. :-P Anyway, the logic here is: * Sooner or later, it will become possible to make highly-capable misaligned AGI that can do things like start pandemics and grab resources. * Sometime before that happens, we need to either ensure that nobody ever builds such an AGI, or that we have built defenses against that kind of AGI. (See my post What does it take to defend the world against out-of-control AGIs? [https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control]) Pocket calculators can do lots of useful things, but they can’t solve the alignment problem, nor can they defend the world against out-of-control AGIs. What about GPT-5+debate? Can GPT-5+debate solve the alignment problem? Can GPT-5+debate defend the world against out-of-control AGIs? My belief splits between these two possibilities: * [much more likely if there are no significant changes in LLM architecture / training paradigms]—No, GPT-5+debate can’t do either of those things. But it can provide helpful assistance to humans trying to work on alignment and/or societal resilience. * But then again, lots of things can increase the productivity of alignment researchers, including lesswrong.com and google docs and pocket calculators. I don’t think this is what debate advocates have in mind, and if it were, I would say that this goal could be better achieved by other means. * [much less likely if there are no significant changes in LLM architecture / training paradigms] Yes, GPT-5+debate can do

For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise). 

Can you provide a citation? I don't think this is true. My reading of this is that (if you're training a dog) you can start with an unconditioned stimulus (sight of food) which causes salivating, and then you can add i... (read more)

 7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals.           

As a consequence, bad actors might have an easier time using powerfull controllable AI to achieve their goals. (From 4 and 6)

8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (F

... (read more)
1Michele Campolo4mo
Sorry for the late reply, I missed your comment. I didn't write something like that because it is not what I meant. I gave an argument whose strength depends on other beliefs one has, and I just wanted to stress this fact. I also gave two examples (reported below), so I don't think I mentioned epistemic and moral uncertainty "in a somewhat handwavy way". Maybe your scepticism is about my beliefs, i.e. you are saying that it is not clear, from the post, what my beliefs on the matter are. I think presenting the argument is more important than presenting my own beliefs: the argument can be used, or at least taken into consideration, by anone who is interested in these topics, while my beliefs alone are useless if they are not backed up by evidence and/or arguments. In case you are curious: I do believe futures shaped by uncontrolled AI are unlikely to happen. Now to the last part of your comment: I agree that bad actors won't care. Actually, I think that even if we do manage to build some kind of AI that is considered superethical (better than humans at ethical reasoning) by a decent amount of philosophers, very few people will care, especially at the beginning. But that doesn't mean it will be useless: at some point in the past, very few people believed slavery was bad, now it is a common belief. How much will such an AI accelerate moral progress, compared to other approaches? Hard to tell, but I wouldn't throw the idea in the bin.

Here's an idea that is in it's infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:

Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)

Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we're moving the goalpost). This seems to be an assumption in OpenAI's alignment plan, though I th... (read more)

I first want to signal-boost Mauricio’s comment.

My experience reading the post was that I kinda nodded along without explicitly identifying and interrogating cruxes. I’m glad that Mauricio has pointed out the crux of “how likely is human civilization to value suffering/torture”. Another crux is “assuming some expectation about how much humans value suffering, how likely are we to get a world with lots of suffering, assuming aligned ai”, another is “who is in control if we get aligned AI”, another is “how good is the good that could come from aligned ai and... (read more)

My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model's capabilities, which makes it safer because the gap between "capability you can elicit" and "underlying capability capacity" is smaller. 

For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF b... (read more)

Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan. 

4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.

I don't know how to link t... (read more)

1VojtaKovarik6mo
Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them. But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect. 

It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.

The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest ... (read more)

1VojtaKovarik6mo
Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense: To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research".[1] And then the corresponding more realistic version of the claims would be that: * either (i') AI assistants will fundamentally be able to speed up alignment much more than capabilities * or (ii') the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research * or (iii') both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway. Comments: * Regarding (iii'): It seems that in the worlds where (iii') holds, you could just as well solve alignment without developing AI assistants. * Regarding (i'): Personally I don't buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i') seems rather hard to me.) * Regarding (ii'): As before, this seems implausible based on the track record :-).   1. ^ This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.

We can also ask about the prior probability . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than 

I think this might be too low given a more realistic training process. Specifically, this is one way the future might go: We train models with gradient descent. Said models develop proxy objectives whic... (read more)

My (very amateur and probably very dumb) response to this challenge: 

tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid deception (typo)detection in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misali... (read more)

AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely. 

Seems worth linking to this post for discussion of ways to limit collusion. I would also point to this relevant comment thread. It seems to me that orthogonal goals is not what we want, as agents with orthogonal goals can cooperate pretty easily to do actions that are a combination of favorable and neutral according to both of them. Instead, we would want agents with exact opposite goals, if such a thing is possi... (read more)

Recently, AGISF has revised its syllabus and moved Risks form Learned Optimization to a recommended reading, replacing it with Goal Misgeneralization. I think this change is wrong, but I don't know why they did it and Chesteron's Fence. 

Does anybody have any ideas for why they did this?

Are Goal Misgeneralization and Inner-Misalignment describing the same phenomenon?

What's the best existing critique of Risks from Learned Optimization? (besides it being probably worse than Rob Miles pedagogically, IMO)

Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be though... (read more)

4David Scott Krueger (formerly: capybaralet)7mo
I would say "it may be better, and people should seriously consider this" not "it is better".

I’m excited to see the next post in this sequence. I think the main counterargument, as has been pointed out by Habryka, Kulveit, and others, is that the graph at the beginning is not at all representative of the downsides from poorly executed buying-time interventions. TAO note the potential for downsides, but it’s not clear how they think about the EV of buying-time interventions, given these downsides.

This post has the problem of trying to give advice to a broad audience; among readers of this post, some should be doing buying-time work, some should do ... (read more)

At a high level, I might summarize they key claims of this post as “It seems like the world today is quite far from being secure against a misaligned AGI. Even if we had a good AGI helping, the steps that would need to be taken to get to a secure state are very unlikely to happen for a variety of reasons: we don’t totally trust the good AGI so we won’t give it tons of free reign (and it would likely need free reign in order to harden every major military / cloud company / etc.), the good AGI is limited because it is being good and thus not doing bold somet... (read more)

Another miracle type thing:

  • ~everybody making progress in capabilities research realizes we have major safety problems and research directions pivot toward this and away from speeding up capabilities. There is a major coordination effort and international regulations that differentially benefit safety. This might happen without some public accident, especially via major community building efforts and outreach. This looks like major wins in AI Governance and AIS field-building. This pushes back timelines a bit, and I think if we act fast this might be enough
... (read more)

Evolution isn't an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter

My understanding of deep learning is that training is also roughly the repeated application of a filter. The filter is some loss function (or, potentially the LLM evaluators like you suggest) which repeatedly selects for a set of model weights that perform well according to that function, similar to how natural selection selects for individuals who are relatively fit. Humans designing ML system... (read more)

This is awesome! I feel weird asking you to plug prompts into the machine. I wonder how it does with logo design, something like “the logo for a new longtermist startup”? Not using for commercial purposes; just curious.
Also curious about some particular word play ala Marry Poppins: “a cat drawing the curtains”