All of Vladimir_Nesov's Comments + Replies

More details on CoEm currently seem to be scattered across various podcasts with Connor Leahy, though a writeup might eventually materialize. I like this snippet (4 minutes, starting at 49:21).

A new kind of thing often only finds its natural role once it becomes instantiated as many tiny gears in a vast machine, and people get experience with various designs of the machines that make use of it. Calling an arrangement of LLM calls a "Scaffolded LLM" is like calling a computer program running on an OS a "Scaffolded system call". A program is not primarily about system calls it uses to communicate with the OS, and a "Scaffolded LLM" is not primarily about LLMs it uses to implement many of its subroutines. It's more of a legible/interpretable/debugg... (read more)

3Stephen Fowler21h
Thank you for the feedback. I'm definitely not sold on any particular terminology and was just aiming to keep things as compatible as possible with existing work.  I wasn't that familiar with Conjecture's work on CoEm, although I had read that outline. It was not immediately obvious to me that their work involved LLMs. 

A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.

Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that's... (read more)

Scott Alexander post that seems very relevant to your example: The Control Group Is Out Of Control. It puts into question even the heuristic of "Is there much more evidence for [blah] than...".

Yeah, I thought to note that in the comment that starts this thread; that's not the kind of thing that seems practical when coordinating updating in an informal way. So more carefully, the intended scope of the comment is formal updating (computing of credences) that's directed informally (choosing the potential observations and hypotheses to pay attention to).

As I disclaimed, the frame of the post does rule out relevance of this point, it's not a response to the post's interpretation that has any centrality. I'm more complaining about the background implication that rewards are good (this is not about happiness specifically). Just because natural selection put a circuit in my mind, doesn't mean I prefer to follow its instruction, either in ways that natural selection intended, or in ways that it didn't. Human misalignment relative to natural selection doesn't need to go along with rewards at all, let alone seek... (read more)

Sure, but that's not about formal-ish updating that frames this post, where you are writing down likelihood ratios and computing credences.

3simon2d
Fair enough. (though...really you could in principle still handle filtered evidence in a formalish way. It just would require a bunch of additional complication regarding your priors and evidence on how the filter operates).

We can consider whatever, there is no fundamental duty to only think in particular ways. The useful constraints are on declaring something a claim of fact, not muddying epistemic commons or damaging decision relevant considerations; and in large quantities, on what makes terrible training data for the brain, damaging the aspects with known good properties. Everything else is work in progress, with boundaries impossible to codify while remaining on human level.

Some thinking processes seem to be more useful for arriving at true or useful results; paying atte... (read more)

What if you had a button that you could press to make other people happy?

Ignoring the frame of the post, which assumes some respect for boundaries, there is the following point about the statement taken on its own. Happiness is a source of reward, and rewards rewire the mind. There is nothing inherently good about it, even systematic pursuit of a reward (while you are being watched) is compatible with not valuing the thing being pursued.

I wouldn't want my mind rewired according to some process I don't endorse, by default it's like brain damage, not some... (read more)

3Spiarrow2d
Thanks for your comment, I think it raises an important point if I understood it correctly. But I'm not sure if I have understood it correctly. Are you saying that by doing random things that make other people happy, I would be messing with their reward function? So that I would, for example, reward and thus incentivise random other things the person doesn't really value? In writing this, I had indeed assumed that while happiness is probably not the only valuable thing and we wouldn't want to hook everybody up to a happiness machine, the marginal bit of happiness in our world would be positive and quite harmless. But maybe superstimuli are a counterexample to that? I have to think about it more.

Given real aliens, they would need to either have capped tech or actively trolling to explain even low quality observations or pieces of craft. Nonintervention laws and incorrigible global anti-high-tech supervision constraining aliens are somewhat plausible, coordinated trolling less so.

2awg2d
Given real aliens, how can you be sure of making any claims at all about their civilization/technology/culture/anything without having the sort of observational evidence that would be necessary to make such claims?  We're in Cartesian Demon territory when discussing these theoretical others. We can plop our human notions on top of them all we want, but unless we have direct, observable evidence of the way "they" think/operate/whatever, we can just as easily assume any given conclusion about them as just as likely as any other. And that includes all the N conclusions we haven't even thought of (or simply can't conceive of due to our necessarily human viewpoint). It seems wildly overconfident to make any claims about them at all that aren't completely hypothetical in the way you describe in your other reply here. Your idea that they either have to have capped tech or be actively trolling is itself just a hypothesis at best, and an idea at worst.

This is not about alien aircraft, this is just a completely wrong way to approach updating. The set of observations/experiments being evaluated is filtered by what was actually observed and by the narrative around the hypothesis (which is in turn not independent from what was actually observed). There are other potential observations that didn't happen, and that fact is also evidence, and yet more observations that did happen but aren't genre-appopriate. By not updating on these other potential observations, the evidence is heavily filtered, and so updatin... (read more)

1simon2d
Agreed that paying attention to how evidence is filtered is super important. But, in principle, you can still derive conclusions from filtered evidence. It's just really hard, especially if the filter is strong and hard to characterize (as is the case with UAPs).

Prediction/compression seems to be working out as a path to general intelligence, implicitly representing situations in terms of their key legible features, making it easy to formulate policies appropriate for a wide variety of instrumental objectives, in a wide variety of situations, without having to adapt the representation for particular kinds of objectives or situations. To the extent brains engage in predictive processing, they are plausibly going to compute related representations. (This doesn't ensure alignment, as there are many different ways of making use of these features, of acting differently in the same world.)

3Bogdan Ionut Cirstea5d
Yes, predictive processing as the reason behind related representations has been the interpretation in a few papers, e.g. The neural architecture of language: Integrative modeling converges on predictive processing [https://www.pnas.org/doi/10.1073/pnas.2105646118]. There's also some pushback against this interpretation though, e.g. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data [https://direct.mit.edu/nol/article/doi/10.1162/nol_a_00087/113632/Predictive-Coding-or-Just-Feature-Discovery-An].

When communicating an argument, the quality of feedback about its correctness you get depends on efforts around its delivery whose shape doesn't depend on its correctness. The objective of improving quality of feedback in order to better learn from it is a check on spreading nonsense.

"my priors against aliens are high" -> "no aliens" -> "no need to do anything"

More carefully, value of information is about how credences change in response to outcomes of feasible things that can be done, not about what the credences are a priori.

more useful to simply take the stance "I don't know"

if your beliefs are coherent they imply an underlying probabilistic model

Decision making motivates having some way of conditioning beliefs on influence at given points of intervention, that's how you estimate the consequences of those interventions and their desirability. To take a stance of "don't know" seems to me analogous to considering how a world model varies with (depends on) the thing you don't know, how it depends on what it turns out to be, or on what the credences about the facts surrounding it are.

With SSL pre-trained foundation models, the interesting thing is that the embeddings computed by them are useful for many purposes, while the models are not trained with any particular purpose in mind. Their role is analogous to beliefs, the map of the world, epistemic side of agency, convergently useful choice of representation/compression that by its epistemic nature is adequate for many applied purposes.

Not sure how the h-morality vs non-h-morality is related to affect though.

This point is in the context of the linked post; a clearer test case is the opposition between p-primeness and primeness. Pebblesorters care about primeness, while p-primeness is whatever a peblesorter would care about. The former is meaningful, while the latter is vacuously circular as guidance/justification for a pebblesorter. Likewise, advising a human to care about whatever a human would care about (h-rightness) is vacuously circular and no guidance at all.

In the implied analo... (read more)

That's one of the points I was making. The agent could be making decisions without needing something affect-like to channel preference, so the fixation on affect doesn't seem grounded in either normative or pragmatic decision making to begin with.

Also, the converse of installing capacity to suffer is getting rid of it, and linking it to legal rights creates dubious incentive to keep it. Affect might play a causal role in finding rightness, but rightness is not justified by being the thing channeled in a particular way. There is nothing compelling about h-rightness, just rightness.

2shminux8d
Right, if the affect capability is not fixed, and in retrospect it rarely is, then focusing on it as a metric means it gets Goodharted if the optimization pressure is strong enough. Which sometimes could be a good thing [https://www.lesswrong.com/posts/a4HzwhvoH7zZEw4vZ/wirehead-your-chickens]. Not sure how the h-morality vs non-h-morality is related to affect though.

Preference/endorsement that is decision relevant on reflection is not about affect. Ability to self-modify to install capacity to suffer because it's a legal requirement also makes the criterion silly in practice.

2shminux8d
Hmm, I guess what you are saying is that if an agent has goals that require external protection through obtaining legal rights, and the only way to do it is to have the capacity to suffer, then the agent would be compelled to learn suffering. Is that right?

I don't think the framing is appropriate, because rights set up the rules of the game built around what is right, or else boundaries against intrusion and manipulation, and there is no reason to single out suffering in particular.

But within the framing that pays attention to suffering, the meaning of capacity to suffer is unclear. I mostly don't suffer in actual experience. Any capacity to suffer would need elicitation in hypothetical events that put me in that condition, modifying my experience of actuality in a way I wouldn't endorse. This doesn't seem i... (read more)

I believe @shminux's perspective aligns with a significant school of thought in philosophy and ethics that rights are indeed associated with the capacity to suffer. This view, often associated with philosopher Jeremy Bentham, posits that the capacity for suffering rather than rationality or intelligence, should be the benchmark for rights.

 

“The question is not, Can they reason?, nor Can they talk? but, Can they suffer? Why should the law refuse its protection to any sensitive being?” – Bentham (1789) – An Introduction to the Principles of Morals and L... (read more)

2shminux8d
It seems like I am missing some of your frame here. My initial point was that an entity that is not capable of suffering (negative affect?) does not need to be protected from it. That point seems self-evident to me, but apparently it is not self-evident to you or others?
2shminux9d
I do not disagree, my point is about the capacity to suffer while alive. Unless I am missing your point.

This is a well-known hypothetical. What goes with it is remaining possibility of de novo creation of additional AGIs that either have architecture particularly suited for self-aligned self-improvement (with whatever values make it tractable), or of AGIs that ignore the alignment issue and pursue the task of capability improvement heedless of resulting value drift. Already having an AGI in the world doesn't automatically rule out creation of more AGIs with different values and architectures, it only makes it easier.

Humans will definitely do this, using all ... (read more)

It's a distinction between these different futures. The present that ends in everyone of Earth dying is clearly different from both, but the present literally everlasting is hopefully not a consideration.

1O O11d
I’m just trying to understand the biggest doomers. I feel like disempowerment is probably hard to avoid. However I don’t think a disempowered future with bountiful lives would be terrible depending on how tiny the kindness weight is/how off it is from us. We are 1/10^53 of the observable universe’s resources. Unless alignment is wildly off base, I see AI directed extinction as unlikely. I fail to see why even figures like Paul Christiano peg it at such a high level, unless he estimates human directed extinction risks to be high. It seems quite easy to create a plague that wipes out humans and a spiteful individual can do it, probably more likely than an extremely catastrophically misaligned AI.

There are two importantly different senses of disempowerment. The stars could be taken out of reach, forever, but human civilization develops in its own direction. Alternatively, human civilization is molded according to AIs' aesthetics, there are interventions that manipulate.

1O O11d
Is there a huge reason the latter is hugely different from the former for the average person excluding world leaders.

Pieces of vehicles given general stealthy attitude imply capped technology. With sufficiently robust alien psychology, this could mean a Dune regime, in which case aliens would need to go out of hiding or less deniably start derailing human AGI projects. Alternatively, there is a non-corrigible anti-foom alien pivotal process AGI watchdog that keeps the tech below some level, which could itself be superintelligent but specialized for this bounded task instead of doing world optimization. In this case the pieces of vehicles are from the aliens in its care, ... (read more)

2avturchin11d
Anthropics arguments seem to support this: we can observe only those parts of the universe where we a) first - or b) grabby aliens are not destroying everything. 

Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.

Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/bou... (read more)

When you don't model your human counterparty's mind anyway, it doesn't matter if they comprehend decision theory. The whole point of delegating to bots is that only understanding of bots by bots remains necessary after that. If your human counterparty doesn't understand decision theory, they might submit a foolish bot, while your understanding of decision theory earns you a pile of utility.

So while the motivation for designing and setting up an arena in a particular way might be in decision theory, the use of the arena doesn't require this understanding of the human users, and yet it can shape incentives in a way that defeats bad equilibria of classical game theory.

PrudentBot's counterparty is another program intended to be legible, not a human. The point is that in practice it's not necessary to model any humans, humans can delegate legibility to programs they submit as their representatives. It's a popular meme that humans are incapable of performing Löbian cooperation, because they can't model each other's messy minds, that only AIs could make their own thinking legible to each other, granting them unique powers of coordination. This is not the case.

if it were happening in real life and not a simulated game

Pro... (read more)

2Max H14d
Right, but my point is that it's still necessary for something to model something. The bot arena setup in the paper has been carefully arranged so that the modelling is in the bots, the legibility is in the setup, and the decision theory comprehension is in the author's brains.  I claim that all three of these components are necessary for robust cooperation, along with some clever system design work to make each component separable and realizable (e.g. it would be much harder to have the modelling happen in the researcher brains and the decision theory comprehension happen in the bots). Two humans, locked in a room together, facing a true PD, without access to computers or an arena or an adjudicator, cannot necessarily robustly cooperate with each other for decision theoretic reasons, even if they both understand decision theory.

without actually being capable of performing the counterparty modeling, legibility, and other cognitive work necessary to implement that decision theory to any degree of faithfulness

This is not needed, you can just submit PrudentBot as your champion for a given interaction, committing to respect the adjudication of an arena that has the champions submitted by yourself and your counterparty. The only legibility that's required is the precommitment to respect adjudication of the arena, which in some settings can be taken out of players' hands by construction.

2Max H14d
PrudentBot is modelling its counterparty, and the setup in which it runs is what makes the modelling and legibility possible. To make PrudentBot work, the comprehension of decision theory, counterparty modelling, and legibility are all required. It's just that these elements are spread out, in various ways, between (a) the minds of the researchers who created the bots (b) the source code of the bots themselves (c) the setup / testbed that makes it possible for the bots to faithfully exchange source code with each other. Also, arenas where you can submit a simple program are kind of toy examples - if you're facing a real, high-stakes prisoner's dilemma and you can set things up such that you can just have some programs make the decisions for you, you're probably already capable of coordinating and cooperating with your counterparty sufficiently well that you could just avoid the prisoner's dilemma entirely, if it were happening in real life and not a simulated game.  

I see. Referring back to your argument was more an illustration of existence for this motivation. If a society forms around the motivation, at any one time in the billion years, and selects for intelligence to enable nontrivial long term institution design, that seems sufficient to escape stasis.

There is truth or calibrated credence or knowing what "good" means or carefully optimizing goodness. Then there are methods that are more or less effective at helping with attaining these things. If you happen to be practicing the better methods, then to the extent they really are effective, you become better at finding truth or calibrated credence or at developing goodness.

And then there is rationality, which is aspiration towards those methods that are better at this. Practicing good methods is sufficient to get results, if the methods actually exist and... (read more)

it's not inherently "smart" to sacrifice those significantly for the sake of a long term project

Your argument was that this hopeless trap might happen after a catastrophe and it's so terrible that maybe it's as bad or worse as everyone dying quickly. If it's so terrible, in any decision-relevant sense, then it's also smart to plot towards projects that dig humanity out of the trap.

3dr_s15d
No, sorry, I may have conveyed that wrong and mixed up two arguments. I don't think stasis is straight up worse than extinction. For good or bad, people lived in the Middle Ages too. My point was more that if your guiding principle is "can we recover", then there are more things than extinction to worry about. If you aspire at some kind of future in which humans grow exponentially then you won't get it if we're knocked back to preindustrial levels and can't recover. I don't personally think that's a great metric or goal to adopt, just following the logic to its endpoint. And I also expect that many smart people in the stasis wouldn't plot with only that sort of long term benefit in mind. They'd seek relatively short term returns.

Intelligence is also a thing that enables perceiving returns that are not immediate, as well as maintenance of more complicated institutions that align current incentives towards long term goals.

1dr_s16d
This isn't a simple marshmallow challenge scenario. If you have a society that has needs and limited resources, it's not inherently "smart" to sacrifice those significantly for the sake of a long term project that might e.g. not benefit anyone who's currently living. It's a difference in values at that point; even if you're smart enough you can still not believe it right. For example, suppose in 1860 everyone knew and accepted global warming as a risk. Should they, or would they, have stopped using coal and natural gas in order to save us this problem? Even if it meant lesser living standards for themselves, and possibly more death?

there's fair odds that if we're knocked back into the millions by a pandemic or nuclear war now we may never pick ourselves back again

Humanity went from Göbekli Tepe to today in 11K years. I doubt even after forgetting all modern learning, it would take even a million years to generate knowledge and technologies for new circumstances. I hear the biosphere can last about a billion years more. (One specific path is to use low-tech animal husbandry to produce smarter humans. This might even solve AI x-risk by making humanity saner.)

1dr_s16d
I disagree it's that easy. It's not a long trajectory of inevitability; like with evolution, there are constraints. Each step generally has to be on its own aligned with economic incentives at the time. See how for example steam power was first developed to fuel pumps removing water from coal mines; the engines were so inefficient that it was only cost effective if you didn't also need to transport the coal. Now we've used up all surface coal and oil, not to mention screwed up the climate quite a bit for the next few millennia, conditions are different. I think technology is less uniform progression and more a mix of "easy" and "hard" events (as in the grabby aliens paper, if you've read it), and by exhausting those resources we've made things harder. I don't think climbing back up would be guaranteed. This IMO even if it was possible would solve nothing while potentually causing an inordinate amount of suffering. And it's also one of those super long term investments that don't align with almost any incentive on the short term. I say it solves nothing because intelligence wouldn't be the bottleneck; if they had any books left lying around they'd have a road map to tech, and I really don't think we've missed some obvious low tech trick that would be relevant to them. The problem is having the materials to do those things and having immediate returns.

Prompted LLM AI personalities are fictional, in the sense that hallucinations are fictional facts. An alignment technique that opposes hallucinations sufficiently well might be able to promote more human-like (non-fictional) masks.

And then there is the Bayesian Mindset, aspiring to put probabilities and utilities on all things.

"Extinction from AI" really doesn't refer to deepfakes, AI leaving us nothing to do, and algorithmic bias. It doesn't include any of these categories. There is nothing correct about interpreting "extinction from AI" as referring to either of those things. This holds even if extinction from AI is absolutely impossible, and those other things are both real/imminent and extremely important. Words have meaning even when they refer to fictional ideas.

1dr_s16d
Deepfakes as in "hey, my uncle posted an AI generated video of Biden eating a baby on FB", no (though that doesn't help our readiness as a species). The general ability of AI to deceive, impersonate, and pretty much break any assumption about who we may believe we are talking to though is a prominent detail that often features in extinction scenarios (e.g. how the AI starts making its own money or manipulating humans into producing thing it needs). I would say "extinction scenarios" include everything that features extinction and AI in the event chain, which doesn't even strictly need to be a takeover by agentic AI. Anyway the actual signed statement is very general. I can guess that some of these people don't worry specifically about the "you are made of atoms" scenario, but that's just arguing against something that isn't in the statement.

I agree, amount of humans and a lot of other utilitarian aims is goodharting for bad proxies. The distinction I was gesturing at is not about amount of what happens, but about perception vs. reality. And a million humans is very different from zero anyone, even if the end was not anticipated nor perceived.

1dr_s16d
Ol, let's consider two scenarios: 1. humanity goes extinct gradually and voluntarily via a last generation that simply doesn't want to reproduce and is cared for by robots to the end, so no one suffers particularly in the process; 2. humanity is locked in a torturous future of trillions in inescapable torture, until the heat death of the universe. Which is better? I would say 1 is. There are things worse than extinction (and some of them are on the table with AI too, theoretically). And anyway you should consider that with how many "low hanging fruit" resources we've used, there's fair odds that if we're knocked back into the millions by a pandemic or nuclear war now we may never pick ourselves back again. Stasis is better than immediate extinction but if you care about the long term future it's also bad (and implies a lot more suffering because it's a return to the past).

My point is I don't think they're incorrect.

Misconstruing an incorrect statement with a correct steelman is incorrect. If I say "I've discovered a truly marvelous proof that 2+2=3000 that this margin is too small to contain," and you reply, "Ah, so you are saying 2+2=4, quite right," then the fact of your inexplicable discussion of a different and correct statement doesn't make your interpretation of my original incorrect statement correct.

1dr_s16d
I explained in the rest of the comment why I don't think they're incorrect, literally. The signed statement anyway is just: So that seems to me like it's succinct enough to include all categories discussed here. I don't see the issue.

It's not that complicated. There is a sense in which these claims are objective (even as the words we use to make them are 2-place words), to the same extent as factual claims, both are seen through my own mind and reified as platonic models. Though morality is an entity that wouldn't be channeled in the physical world without people, it actually is channeled, the same as the Moon actually is occasionally visible in the sky.

as a function of the people who are about to witness it and know they are the last

My point is not about anyone's near term subject... (read more)

2dr_s16d
I really, really don't care about what happens in the distant future compared to what happens now, to humans that actually exist and feel. I especially don't care about there being an arbitrarily high amount of humans. I don't think a trillion humans is any better than a million as long as: 1. they are happy 2. whatever trajectory lead to those numbers didn't include any violent or otherwise painful mass death event, or other torturous state. There really is nothing objective about total sum utilitarianism; and in fact, as far as moral intuitions go, it's not what most people follow at all. With things like "actually death is bad" you can make a very cogent case: people, day to day, usually don't want to die, therefore there never is a "right moment" in which death is not a violation; if there was, people can still commit suicide anyway, thus death by old age or whatever else is just bad. That's a case where you can invoke the "it's not that complicate" argument IMO. Total sum utilitarianism is not; I find it a fairly absurd ethical system, ripe for exploits so ridiculous and consequences so blatantly repugnant that it really isn't very useful at all.

Sure, natural selection would also technically be an AGI by my definition as stated, so there should be subtext of it taking no more than a few years to discover human-without-supercomputers-or-AI theoretical science from the year 3000.

The discussions I've seen have mentioned things like deepfakes, autonomous weapons, designer pathogens, AI leaving us nothing to do, and algorithmic bias.

I honestly think that's for the best because I don't believe super fast takeoff FOOM scenarios are actually realistic.

When a claim is wrong, ignoring its wrongness and replacing it in your own perception with a corrected steelman of completely different literal meaning is not for the best. The sane thing would be to call out the signatories for saying something wildly incorrect, not pretending tha... (read more)

1dr_s16d
My point is I don't think they're incorrect. All those things are ALSO problems, and many are paths to X-risk even, which I'd consider more likely (in a slow takeoff scenario) than FOOM. A few possible scenarios: 1. designer pathogens are the obvious example because there's so many ways they can cause targeted human extinction without risking the AI's integrity in any way, so they're obvious candidates for both a misuse of AI by malicious humans and for a rogue AI that is actively trying to kill us for whatever reason. Plus there's also the related risk of designer microorganisms that alter the Earth's biosphere as a whole, which would be even deadlier (think something like the cyanobacteria that caused the Great Oxidation Event, except now it'd be something that grows out of control and makes the atmosphere deadly) 2. autonomous weapons are obviously dangerous in the hands of an AI because that's really handing it our defenses on a silver platter and makes it turning against us much more likely to succeed (and thus a potentially attractive strategy to seek power) 3. deepfakes or actual even more sophisticated forms of deception could be exploited by an AI to manipulate humans towards carrying out actions that benefit it or allow it to escape confinement and so on. Being able to see through deepfakes and defend against them would be key to the security of important strategic resources 4. we don't know what happens socially and economically if AGI really takes everyone's job. We go into unexplored territory, and more so, we go there from a place where the AGI will probably be owned by a few to begin with. We might hope for glorious post-scarcity abundance but that's not the only road nor, I fear, the most likely. If the transition goes badly it can weaken our species enough that all AGI needs to do to get rid of us for good is give a little push. All these things of course ar

Not every system of values places extinction on its own special pedestal [...] in terms of expected loss of life AI could be even with those other things

Well this is wrong, and I'm not feeling any sympathy for a view that it's not. An eternity of posthuman growth after recovering from a civilization-spanning catastrophe really is much better than lights out, for everyone, forever.

I agree that there are a lot of people who don't see this, and will dismiss a claim that expresses this kind of thing clearly. In mainstream comments to the statement, I've see... (read more)

1dr_s16d
You can't really say anything is objectively wrong when it comes to morals, but also, I generally think that evaluating the well-being of potential entities to be leads to completely nonsensical moral imperatives like the Repugnant Conclusion. Since no one experiences all of the utility at the same time, I think "expected utility probability distribution" is a much more sensible metric (as in, suppose you were born as a random sentient in a given time and place: would you be willing to take the bet?). That said, I do think extinction is worse than just a lot of death, but that's as a function of the people who are about to witness it and know they are the last. In addition, I think omnicide is worse than human extinction alone because I think animals and the rest of life have moral worth too. But I wouldn't blame people for simply considering extinction as 8 billion deaths, which is still A LOT of deaths anyway. It's a small point that's worthless arguing. We have wide enough uncertainties on the probability of these risks anyway that we can't really put fixed numbers to the expect harms, just vague orders of magnitude. While we may describe them as if they were numerical formulas, these evaluations really are mostly qualitative; enough uncertainty makes numbers almost pointless. Suffice to say, I think if someone considers, say, a 5% chance of nuclear war a bigger worry than a 1% chance of AI catastrophe, then I don't think I can make a strong argument for them being dead wrong. I agree this makes no sense, but it's a completely different issue. That said, I think the biggest uncertainty re: X-risk remains whether AGI is really as close as some estimate it is at all. But this aspect is IMO irrelevant when judging the opportunity of actively trying to build AGI. Either it's possible, and then it's dangerous, or it's still way far off, and then it's a waste of money and precious resources and ingenuity.

I think a good definition for AGI is capability for open-ended development, the point where the human side of the research is done, and all it needs to reach superintelligence from that point on is some datacenter maintenance and time, so that eventually it can get arbitrarily capable in any domain it cares for, on its own. This is a threshold relevant for policy and timelines. GPT-4 is below that level (it won't get better without further human research, no matter how much time you give it), and ability to wipe out humans (right away) is unnecessary for reaching this threshold.

2faul_sname16d
I think we also care about how fast it gets arbitrarily capable. Consider a system which finds an approach which can measure approximate actions-in-the-world-Elo (where an entity with an advantage of 200 on their actions-in-the-world-Elo score will choose a better action 76% of the time), but it's using a "mutate and test" method over an exponentially large space, such that the time taken to find the next 100 point gain takes 5x as long, and it starts out with an actions-in-the-world-Elo 1000 points lower than an average human with a 1 week time-to-next-improvement. That hypothetical system is technically a recursively self-improving intelligence that will eventually reach any point of capability, but it's not really one we need to worry that much about unless it finds techniques to dramatically reduce the search space. Like I suspect that GPT-4 is not actually very far from the ability to come up with a fine-tuning strategy for any task you care to give it, and to create a simple directory of fine-tuned models, and to create a prompt which describes to it how to use that directory of fine-tuned models. But fine-tuning seems to take an exponential increase in data for each linear increase in performance [https://platform.openai.com/docs/guides/fine-tuning/prepare-training-data], so that's still not a terribly threatening "AGI".

Our results below show that process supervision in fact incurs a negative alignment tax

Some compelling arguments are given that alignment tax would be negative when this method is used to improve safety. The actual experimental results are about improving/eliciting capabilities and don't explore application of the method for safety, except by drawing an analogy.

a much more ideal thing to be kind towards than current humans

Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that's a different thing.

<1/trillion [kindness]

I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that's not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that's too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that e... (read more)

For purposes of morality or decision making, environments that border membranes are a better building block for scopes of caring than whole (possible) worlds, which traditionally fill this role. So it's not quite a particular bare-bones morality, but more of a shared scaffolding that different agents can paint their moralities on. Agreement on boundaries is a step towards cooperation in terms of scopes of caring delimited by these boundaries. Different environments then get optimized according to different preferences, according to coalitions of agents tha... (read more)

1Chipmonk18d
To be clear, I think the bare-bones morality in that post [https://www.lesswrong.com/posts/KX3xx8LTnE7GKoFuj/boundaries-for-formalizing-a-bare-bones-morality] comes from "observe boundaries and then try not to violate them" (or, in Davidad's case: and then proactively defend them (which is stronger)).  I'll need to think about the rest of your comment more, hm. If you think of examples please lmk:) Also, wdym that a logical dependency could be itself a membrane? Eg? One thing— I think the «membranes/boundaries» generator would probably reject the "you get to choose worlds" premise, for that choice is not within your «membrane/boundary». Instead, there's a more decentralized (albeit much smaller) thing that is within your «membrane» that you do have control over. (Similar example here [https://www.lesswrong.com/posts/rz6XziKYmDqDvEo6u/a-short-critique-of-boundaries-part-2-example-1-expansive] wrt to Alex, global poverty, and his donations.)

I think the use of the frame is in replacing agents by membranes and environments. The only way of interacting with an agent is via their membrane. An agent could be enclosed in multiple nested membranes, and you need to know which membrane you are interacting through, so really you are interacting with a certain membrane, not with a certain agent.

The you that interacts with that membrane lives on one of the sides of it, within an environment bordering the membrane. This you is also presented by a membrane that fits the environment it lives in and borders.... (read more)

1Chipmonk18d
oooo, that's interesting! maximal abstraction… *membranes all the way down...* I do like this, I'll have to think about it more. Another thing this makes me realize that I like about this is that it requires no privileged perspective: you don't actually know that anything is an agent (just like reality— solipism— etc.)— all you know is that there are membranes and things you can't control… I already think that trying to control fate is a membrane/boundary violation just as trying to control another sovereign agent is a membrane/boundary violation, and this would naturally make those the same.   Though, I think if the frame 'agent' were to be abandoned, we'd need a sort of meta-membrane that also accounts for the membrane's ability to leak, break, and get stronger/weaker ? (Eg "If you stab this membrane, it will pop" would be part of the interface of the meta-membrane, not the object-level membrane ?)   Thank you!   Random thoughts/clarification-to-myself about nested membranes: * I agree that, as seen from the outside, agents can overlap in in nested membranes. I exist as a person exist within (I am) my physical and informational membrane, and 'I' also reside (in some ways, though perhaps not all of the same ways) in the membrane of my country as one of its citizens. To the outside, I am nested within the country. One thing I do want to clarify though, is that from my perspective everything is also nested. There's still a membrane that mediates how I interact with my country, for example— even though in some ways I exist within my country.

It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.

3dr_s16d
Eh, I think this is really splitting hairs. I have seen already multiple people using the lack of reference to climate change to dismiss the whole thing. Not every system of values places extinction on its own special pedestal (though I think in this case "biological omnicide" might be more it: unlike pandemics, AI could also kill the rest of non-human life). But in terms of expected loss of life AI could be even with those other things if you consider them more likely.
Load More