All of Thomas Kwa's Comments + Replies

It's not just his fiction. Recently he went on what he thought was a low-stakes crypto podcast and was surprised that the hosts wanted to actually hear him out when he said we were all going to die soon:

I don't think we can take this as evidence that Yudkowsky or the average rationalist "underestimates more average people". In the Bankless podcast, Eliezer was not trying to do anything like trying to explore the beliefs of the podcast hosts, just explaining his views. And there have been attempts at outreach before. If Bankless was evidence towards "t... (read more)

2cata1h
This makes little sense to me, since "what should I do" isn't a function of p(doom). It's a function of both p(doom) and your inclinations, opportunities, and comparative advantages. There should be many people for whom, rationally speaking, a difference between 35% and 34% should change their ideal behavior.

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

2Raemon10d
I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then .... (read more)

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

4TurnTrout13d
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning [https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector#Activation_additions_have_advantages_over__RL_supervised__finetuning]) I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post. There's another strength which I hadn't mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn't just naively finetune that honesty into the model.[1] But, in a sense, task vectors are "still in the same modalities we're used to." Activation additions jolted me because they're just... a new way[2] of interacting with models! There's been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.  1. ^ This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry. 2. ^ To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section: But this "activation engineer

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far O... (read more)

3Dan H14d
(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper) [https://arxiv.org/pdf/2303.16200.pdf#page=28]. If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.) I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089 [https://arxiv.org/abs/2212.04089]. I would want a weight version of a method and an activation version of a method; they tend to have different strengths. Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 [https://arxiv.org/abs/2212.04089] were tweeted on https://twitter.com/topofmlsafety [https://twitter.com/topofmlsafety] and posted on https://www.reddit.com/r/mlsafety [https://www.reddit.com/r/mlsafety] Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.
Thomas Kwa15dΩ1538-13

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training. 

Using chatbots and feeling ok about it seems like a no-brainer. It's technology that provides me a multiple percentage point productivity boost, it's used by over a billion people, and a boycott of chatbots is well outside the optimal or feasible space of actions to help the world.

I think the restaurant analogy fails because ChatGPT was not developed in malice, just recklessness. For the open source models, there's not even an element of greed.

5Adam Zerner18d
My impression is that if the restaurant owner killed the family out of recklessness instead of malice, most people would still feel a very strong sense of disdain and choose to avoid the restaurant.

It doesn't look circular to me? I'm not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like "very rhetorically persuasive" or "tricks the human into typing a very large number into the rating box" that won't affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we'll probably have to replace independence with this.

I think you're splitting hairs. We prove Goodhart follows from certain assumptions, and I've given some justification for ... (read more)

In my frame,  is not just some variable correlated with , it's some estimator's best estimate, and so it makes sense that residuals  would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.

The basic idea why we might get  is that there are some properties that increase the overseer's rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the o... (read more)

1rotatingpaguro18d
Are you saying that your (rough, preliminary) justification for independence is that it's what gets you Goodhart, so you use it? Isn't this circular? Ok so maybe I misinterpreted your intentions: I thought you wanted to "prove" that Goodhart happens, while possibly you wanted to "show an example" of Goodhart happening?

I think this is more like Extremal Goodhart in Garrabrant's taxonomy: there's a distributional shift inherent to high .

1David Johnston18d
Maybe it’s similar, but high U is not necessary

SGD has inductive biases, but we'd have to actually engineer them to get high  rather than high  when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.

I think the point still holds in mainline shard theory world, which in m... (read more)

That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.

2habryka19d
Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.

Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

I'd be much happier with increasing participants enough to equal 10-20% of the field of ML than a 6 month unconditional pause, and my guess is it's less costly. It seems like leading labs allowing other labs to catch up by 6 months will reduce their valuations more than 20%, whereas diverting 10-20% of their resources would reduce valuations only 10% or so.

There are currently 300 alignment researchers. If we take additional researchers from the pool of 30k people who attended ICML, you get 3000 researchers, and if they're equal quality this is 10x particip... (read more)

If I've already done WMLB, what day should I start on? The WMLB curriculum on mechinterp wasn't very polished, and IOI and superposition were not covered. But doing part of the transformers week would mean getting material I've already learned on RL.

1TheMcDouglas1mo
The first week of WMLB / MLAB maps quite closely onto the first week of ARENA, with a few exceptions (ARENA includes PyTorch Lightning, plus some more meta stuff like typechecking, VSCode testing and debugging, using GPT in your workflow, etc). I'd say that starting some way through the second week would probably be most appropriate. If you didn't want to repeat stuff on training / sampling from transformers, the mech interp material would start on Wednesday of the second week.

Fair point. Another difference is that the pause is popular! 66-69% in favor of the pause, and 41% think AI would do more harm than good vs 9% for more good than harm.

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

  • It's incredibly difficult and incentive-incompatible with existing groups in power
  • There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
  • There are some obvious negative effects; potential overhangs or greater inc
... (read more)
2TurnTrout1mo
Why does this have to be true? Can't governments just compensate existing AGI labs for the expected commercial value of their foregone future advances due to indefinite pause? 
6quetzal_rainbow1mo
This statement begs for cost-benefit analysis.  Increasing size of alignment field can be efficient, but it won't be cheap. You need to teach new experts in the field that doesn't have any polised standardized educational programs and doesn't have much of teachers. If you want not only increase number of participants in the field, but increase productivity of the field 10x, you need an extraordinary educational effort.  Passing regulation to require evals seems like a meh idea. Nobody knows in enough details how to make such evalutions and every wrong idea that makes its way to law will be here until the end of the world.

The obvious dis-analogy is that if the police had no funding and largely ceased to exist, a string of horrendous things would quickly occur. Murders and thefts and kidnappings and rapes and more would occur throughout every country in which it was occurring, people would revert to tight-knit groups who had weapons to defend themselves, a lot of basic infrastructure would probably break down (e.g. would Amazon be able to pivot to get their drivers armed guards?) and much more chaos would ensue.

And if AI research paused, society would continue to basically function as it has been doing so far.

One of them seems to me like a goal that directly causes catastrophes and a breakdown of society and the other doesn't.

What information? What spectrum? The color information received by the webcam is the total intensity of light when passed through a red filter, the total intensity when passed through a blue filter, and the total intensity when passed through a green filter, at each point. You do not know the frequency of these filters (or that frequency of light is even a thing). I'm sure you could deduce something by playing around with relative intensities and chromatic aberration, but ultimately you cannot build a spectrum with three points. 

I don't think we disag... (read more)

4titotal1mo
Indeed! Deriving physics requires a number of different experiments specialized to the discovery of each component. I could see how a spectrograph plus an analysis of the bending of light could get you a guess that light is quantised via the ultraviolet catastrophe [https://en.wikipedia.org/wiki/Ultraviolet_catastrophe], although i'm doubtful this is the only way to get the equation describing the black body curve. I think you'd need more information like the energy transitions of atoms or maxwells equations to get all the way to quantum mechanics proper though. I don't think this would get you to gravity either, as quantum physics and general relativity are famously incompatible on a fundamental level. 

Some thoughts:

  • I don't think you should give a large penalty to inverse square compared to other functions. It's pretty natural once you understand that reality has three dimensions.
  • The conclusion seems pretty reasonable, assuming that the alternate hypotheses are simpler. This is not obvious to me-- Eliezer claims the K complexity of the laws of physics is only ~500 bits. I'm not sure whether Newtonian physics is simpler than relativity once you include the information about electromagnetism contained in the apple.
  • If you have the apple's spectrum, the prob
... (read more)
9titotal1mo
This is a fair point. 1/r2 would definitely be in the "worth considering" category. However, where is the evidence that the gravitational force is varying with distance at all? This is certainly impossible to observe in three frames.  What information? What spectrum? The color information received by the webcam is the total intensity of light when passed through a red filter, the total intensity when passed through a blue filter, and the total intensity when passed through a green filter, at each point. You do not know the frequency of these filters (or that frequency of light is even a thing). I'm sure you could deduce something by playing around with relative intensities and chromatic aberration, but ultimately you cannot build a spectrum with three points.  It depends on what you mean by limited data. All of these observations rely on the extensive body of knowledge and extensive experimentation we have done on earth to figure out the laws of physics that is shared between earth and these outer worlds. 

I don't know how to engage with the first two comments. As for diffusion being slow, you need to argue that it's so slow as to be uncompetitive with replication times of biological life, and that no other mechanism for placing individual atoms / small molecules could achieve better speed and energy efficiency, e.g. this one.

I don't have the expertise to evaluate the comment by Muireall, so I made a Manifold market.

1bhauth1mo
Such actuator design specifics aren't relevant to my point. If you want to move a large distance, powered by energy from a chemical reaction, you have to diffuse to the target point, then use the chemical energy to ratchet the position. That's how kinesin works. A chemical reaction doesn't smoothly provide force along a range of movement. Thus, larger movements per reaction take longer.

I'm not sure how to evaluate this, so I made a Manifold market for it. I'd be excited for you to help me edit the market if you endorse slightly different wording.

https://manifold.markets/ThomasKwa/does-thermal-noise-make-drexlerian

I agree and expanded on this in a comment.

3JenniferRM1mo
Voting is, of necessity, pleiotropically [https://en.wikipedia.org/wiki/Pleiotropy] optimized. It loops into reward structures for author motivation, but it also regulates position within default reading suggestion hierarchies for readers seeking educational material, and it also potentially connects to a sense that the content is "agreed to" in some sort of tribal sense. If someone says something very "important if true and maybe true" that's one possible reason to push the content "UP into attention" rather than DOWN. Another "attentional" reason might be if some content says "the first wrong idea that occurs to nearly everyone, which also has a high quality rebuttal cleanly and saliently attached to it". That is, upvotes can and maybe should flow certain places for reasons of active value-of-information [https://en.wikipedia.org/wiki/Value_of_information] and/or pedagogy [https://www.lesswrong.com/tag/distillation-and-pedagogy/]. Probably there are other reasons, as well! 😉  A) As high-quality highly-upvoted rebuttals like Mr Kwa's have arrived, I've personally been thinking that maybe I should reverse my initial downvote, which would make this jump even higher. I'm a very unusual voter, but I've explained my (tentative) theories of upvoting once or twice, and some people might have started to copy me. B) I could imagine some voters were hoping (as I might if I thought about it some more and changed my mind on what my voting policy should be in very small ways) to somehow inspire some good rebuttals, by pre-emptively upvoting things in high VoI areas where LW simply hasn't had much discussion lately? C) An alternative explanation is of course that a lot of LW voters haven't actually looked at nanotech very much, and don't have good independent object level takes, and just agreed with the OP because they don't know any better and it seemed plausible and well written. (This seems the most likely to me, fwiw.) D) Another possibility is, of course, that there

Not an expert in chemistry or biochemistry, but this post seems to basically not engage with the feasibility studies Drexler has made in Nanosystems, and makes a bunch of assertions without justification, including where Nanosystems has counterarguments. I wish more commenters would engage on the object level because I really don't have the background to, and even I see a bunch of objections. Nevertheless I'll make an attempt. I encourage OP and others to correct me where I am ignorant of some established science.

Points 1, 2, 3, 4 are not relevant to Drexl... (read more)

1bhauth1mo
No, that does not follow. ...for one thing, that's not airtight. No, the steps happen by diffusion so they become slower. That's why slower muscles are more efficient. see this reply [https://www.lesswrong.com/posts/FijbeqdovkgAusGgz/grey-goo-is-unlikely?commentId=t3f5oK9XH6eX2aCNf]

I'm also assuming V is not bounded above.

I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about  and are optimizing for a proxy , Goodhart's Law sometimes implies that optimizing hard enough for  causes  to stop increasing.

A large part of the post would be proofs about what the distributions of  and  must be for , where X and V are independent random variables with mean zero. It's clear that

  • X must be heavy-tailed (or long-tailed or som
... (read more)
3Arthur Conmy1mo
Is bullet point one true, or is there a condition that I'm not assuming? E.g if $V$ is the constant $0$ random variable and $X$ is $N(0, 1)$ then the limit result holds, but a Gaussian is neither heavy- nor long-tailed [https://en.wikipedia.org/wiki/Heavy-tailed_distribution#Definitions].
4leogao1mo
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17 [https://arxiv.org/pdf/2210.10760.pdf#page=17]). Jacob probably has more detailed takes on this than me.  In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Belrose et al found that the tuned lens is generally superior to the logit lens. Would the results change if the tuned lens were used here? My guess is probably not, since in the paper there is little difference when applying the two techniques to later layers, but maybe it's worth a try.

1Curt Tigges2mo
Yes, tuned lens is an excellent tool and generally superior to the original logit lens. In this particular case, I don't think it would show very different results, however (and in any case the logit lens is only a small part of the analysis), but I think it would be interesting to have some kind of integration with TransformerLens that enabled the training and usage of tuned lens as well.

In future posts, we will describe a more complete categorisation of these situations and how they relate to the AI alignment problem. 

Did this ever happen?

1Jeremy Gillen2mo
No, Justin knows roughly the content for the intended future posts but after getting started writing I didn't feel like I understood it well enough to distill it properly and I lost motivation, and since then I became too busy. I'll send you the notes that we had after Justin explained his ideas to me.

after talking to Eliezer, I now have a better sense of the generator of this list. It now seems pretty good and non-arbitrary, although there is still a large element of taste.

-2Cedar3mo
Ty 4 the catch. Used chatgpt to generate the html and i think when i asked it to add the CSS, it didn't have enough characters to give me everything.
4jmh3mo
That would be Zebra for those interested.

Suppose an agent has this altruistic empowerment objective, and the problem of getting an objective into the agent has been solved.

Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.

I think this reflects two problems:

  • Most injective functions from human actions to world-states are not "human
... (read more)

I'm offering a $300 bounty to anyone that gets 100 karma doing this this year (without any vote manipulation).

Manifold market for this:

They also separately believe that by the time an AI reaches superintelligence, it will in fact have oriented itself around a particular goal and have something like a goal slot in its cognition - but at that point, it won’t let us touch it, so the problem becomes we can't put our own objective into it.

My guess is this is a bit stronger than what Nate believes. The corresponding quote (emphasis mine) is

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a b

... (read more)

Feynman once challenged people to come up with a problem that could be stated quickly but he couldn't solve to within 10% in a minute, and a colleague stumped him with finding .

I like this point, but downvoted this because it didn't reference prior work

2the gears to ascension4mo
Somewhat reasonable, but I'd argue that it's preferable to do the citations lit review as a commenter. eg, I'll give it a shot.

Even if it has some merits, I find the "death with dignity" thing an unhelpful, mathematically flawed, and potentially emotionally damaging way to relate to the problem. Even if MIRI has not given up, I wouldn't be surprised if the general attitude of despair has substantially harmed the quality of MIRI research. Since I started as a contractor for MIRI in September, I've deliberately tried to avoid absorbing this emotional frame, and rather tried to focus on doing my job, which should be about computer science research. We'll see if this causes me problems.

I made a Manifold market for some key claims in this post:

Here's how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you've been specifically taught that chocolate is bad for dogs, maybe this is a bad example.

I also wasn't trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.

FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

6Ben Pace4mo
Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

What was the equation for research progress referenced in Ars Longa, Vita Brevis?

“Then we will talk this over, though rightfully it should be an equation. The first term is the speed at which a student can absorb already-discovered architectural knowledge. The second term is the speed at which a master can discover new knowledge. The third term represents the degree to which one must already be on the frontier of knowledge to make new discoveries; at zero, everyone discovers equally regardless of what they already know; at one, one must have mastered every

... (read more)
2gwern4mo
I don't think Scott had a specific concrete equation in mind. (I don't know of any myself, and Scott would likely have referenced or written it up on SSC/ACX by now if he had one in mind.) However, conceptually, it's just a variation on the rocket equation [https://en.wikipedia.org/wiki/Tsiolkovsky_rocket_equation] or jeep problem [https://en.wikipedia.org/wiki/Jeep_problem], I think.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

2johnswentworth5mo
I don't think this post is aimed at the same concept(s).

We're definitely unlucky that, of the two challenges, this has been solved and AI strategy is unsolved.

There's a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.

  • The agent can have all the biases humans do.
  • The agent can violate VNM axioms in any other way that doesn't ruin it, basic
... (read more)

Here's one factor that might push against the value of Steinhardt's post as something to send to ML researchers: perhaps it is not arguing for anything controversial, and so is easier to defend convincingly. Steinhardt doesn't explicitly make any claim about the possibility of existential risk, and barely mentions alignment. Gates spends the entire talk on alignment and existential risk, and might avoid being too speculative because their talk is about a survey of basically the same ML researcher population as the audience, and so can engage with the most ... (read more)

6Kaj_Sotala5mo
On the other hand, there's something to be said about introducing an argument in ways that are as maximally uncontroversial as possible, so that they smoothly fit into a person's existing views but start to imply things that the person hasn't considered yet.  If something like the Steinhardt posts gets researchers thinking about related topics by themselves, then that might get them to a place where they're more receptive to the x-risk arguments a few months or a year later - or even end up reinventing those arguments themselves. I once saw a comment that went along the lines of "you can't choose what conclusions people reach, but you can influence which topics they spend their time thinking about". It might be more useful to get people thinking about alignment topics in general, than to immediately sell them on x-risk specifically. (Edited to add: not to mention that trying to get people thinking about a topic, is better epistemics than trying to get them to accept your conclusion directly.) 

+1 to this, I feel like an important question to ask is "how much did this change your mind?". I would probably swap the agree/disagree question for this?

I think the qualitative comments also bear this out as well:

dislike of a focus on existential risks or an emphasis on fears, a desire to be “realistic” and not “speculative”

This seems like people like AGI Safety arguments that don't really cover AGI Safety concerns! I.e. the problem researchers have isn't so much with the presentation but the content itself.

I agree with the following caveats:

  • I think you're being unfair to that Rob tweet and the MIRI position; having enough goal-directedness to maximize the number of granite spheres + no special structure to reward humans is a far weaker assumption than utility maximization. The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic. (note that I haven't talked to him and definitely don't speak for him or MIRI)
  • Shard theory is mostly just a frame and hasn't discov
... (read more)
8TurnTrout5mo
As an aside: If one thinks 1000 goals is more realistic, then I think it's better to start communicating using examples like that, instead of "single goal" examples. (I myself lazily default to "paperclips" to communicate AGI risk quickly to laypeople, so I am critiquing myself to some extent as well.) Anyways, on your read, how is "maximize X-quantity" different from "max EU where utility is linearly increasing in granite spheres"?
3DragonGod5mo
1. Yeah, I think that's fair. I may have pattern matched/jumped to conclusions too eagerly. Or rather, I've been convinced that my allegation is not very fair. But mostly, the Rob tweet provided the impetus for me to synthesise/dump all my issues with EU maximisation. I think the complaint can stand on its own, even if Rob wasn't quite staking the position I thought he was. That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don't actually think the danger directly translates. And I think it's unlikely that multi-objective optimisers would not care about humans or other agents. I suspect the value shard formation hypotheses would imply instrumental convergence towards developing some form of morality. Cooperation is game theoretically optimal. Though it's not clear yet, how accurate the value shard formation hypothesis is true.   2. I'm not relying too heavily on Shard Theory I don't think. I mostly cited it because it's what actually lead me in that direction not because I fully endorse it. The only shard theory claims I rely on are: * Values are contextual influences on decision making * Reward is not the optimisation target   Do you think the first is "non obvious"?

I feel like FTX is a point against utilitarianism for the same reasons Bentham is a point for utilitarianism. If you take an ethical system to logical conclusions and anticipate feminism, animal rights, etc. this is evidence for a certain algorithm creating good in practice. If you commit massive fraud this is evidence against.

This also doesn't shift my meta-ethics much, so maybe I'm not one of the people you're talking about?

2DirectedEvolution6mo
It seems like the explanation is that the desire to conform to the imagined norm in order to reinforce identity is so powerful that it can even override loss aversion to a large degree.

Hypothesis: much of this is explained by the simpler phenomenon of loss aversion. $1 to your ingroup is a gain, $1 to your outgroup is a loss and therefore mentally multiplied by ~2. The paper finds a factor of 3, so maybe there's something else going on too.

5DirectedEvolution6mo
I thought about that, but I think it doesn't quite fit the details of the study. For example, in Study 1, they asked people to choose between two options: 1. Give opponents $1, no effect on you. 2. Your side loses $1, no effect on opponents. The second option was much more popular, even though it involved taking a loss. So it seems to me that, if anything, loss aversion makes these results even more surprising. What do you think?

not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.

  • In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
  • I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
  • Currently on the cutting edge, the most a
... (read more)

There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.

Load More