# 12

Personal Blog

If it’s worth saying, but not worth its own post, here's a place to put it. (You can also make a shortform post)

And, if you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are welcome.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the new Concepts section.

The Open Thread tag is here.

New Comment

Covid19Projections has been one of the most successful coronavirus models in large part because it is as 'model-free' and simple as possible, using ML to backtrack parameters for a simple SEIR model from death data only. This has proved useful because case numbers are skewed by varying numbers of tests, so deaths are more consistently reliable as a metric. You can see the code here.

However, in countries doing a lot of testing, with a reasonable number of cases but with very few deaths, like most of Europe, the model is not that informative, and essentially predicts near 0 deaths out to the limit of its measure. This is expected - the model is optimised for the US.

Estimating SEIR parameters based on deaths works well when you have a lot of deaths to count, if you don't then you need another method. Estimating purely based on cases has its own pitfalls - see this from epidemic forecasting, which mistook an increase in testing in the UK mid-july for a sharp jump in cases and wrongly inferred brief jump in R_t. As far as I understand their paper, the estimate of R_t from case data adjusts for delays in infection to onset and for other things, but not for the positivity rate or how good overall testing is.

This isn't surprising - there is no simple model that combines test positivity rate and the number of cases and estimates the actual current number of infections. But perhaps you could use a Covid19pro like method to learn such a mapping.

Very oversimplified, Covid19pro works like this:

Our COVID-19 prediction model adds the power of artificial intelligence on top of a classic infectious disease model. We developed a simulator based on the SEIR model (Wikipedia) to simulate the COVID-19 epidemic in each region. The parameters/inputs of this simulator are then learned using machine learning techniques that attempts to minimize the error between the projected outputs and the actual results. We utilize daily deaths data reported by each region to forecast future reported deaths. After some additional validation techniques (to minimize a phenomenon called overfitting), we use the learned parameters to simulate the future and make projections.

And the functions f and g, estimate the SEIR (susceptible, exposed, infectious, recovered) parameters from current deaths up to some time t_0, and the future deaths based on those parameters respectively. These functions are then both optimised to minimise error when the actual number of deaths at t_1 is fed into the model.

This oversimplification is deliberate:

Deaths data only: Our model only uses daily deaths data as reported by Johns Hopkins University. Unlike other models, we do not use additional data sources such as cases, testing, mobility, temperature, age distribution, air traffic, etc. While supplementary data sources may be helpful, they can also introduce additional noise and complexity which can notably skew results.

What I suggest is a slight increase in complexity, where we use a similar model except we feed it paired test positivity rate and case data instead of death data. The positivity rate /tests per case serves as a 'quality estimate' which serves to tell you how good the test data is. That's how tests per case is treated by our world in data. We all know intuitively that if positivity rate is going down but cases are going up, the increase might not be real, but if positivity rate is going up and cases are going up the increase definitely is real.

What I'm suggesting is that we combine do something like this:

Now, you need to have reliable data on the number of people tested each week, but most of Europe has that. If you can learn a model that gives you a more accurate estimate of the SEIR parameters from combined cases and tests/case data, then it should be better at predicting future infections. It won't necessarily predict future cases, since the number of future cases is also going to depend on the number of tests conducted, which is subject to all sorts of random fluctuations that we don't care about when modelling disease transmission, so instead you could use the same loss function as the original covid19pro - minimizing the difference between projected and actual deaths.

Hopefully the intuition that you can learn more from the pair (tests/case, number of cases) than number of cases or number of deaths alone should be borne out, and a c19pro-like model could be trained to make high quality predictions in places with few deaths using such paired data. You would still need some deaths for the loss function and fitting the model.

Greetings all, and thanks for having me! :) I'm an AI enthusiast, based in Hamilton NZ. Where until recently I was enrolled in and studying strategic management and computer science. Specifically, 'AI technical strategy'. After corona virus and everything that's been happening in the world, I've moved away from formal studies and are now focusing on using my skills etc, in a more interactive and 'messy' way. Which means more time online with groups like LessWrong. :) I've been interested in rationality and the art of dialogue since early 2000's. I've been involved in startups and AI projects, from a commercial perspective for a while. Specifically in the agri-tech space. I would like to understand and grow appreciation more, for forums like this, where the technology essentially enables better and more productive human interaction.

Welcome!

Is it plausible that an AGI could have some sort of exploit (buffer overflow maybe?) that could be exploited (maybe by an optimization daemon…?) and cause a sign flip in the utility function?

How about an error during self-improvement that leads to the same sort of outcome? Should we expect an AGI to sanity-check its successors, even if it’s only at or below human intelligence?

It freaks me out that we have Loss Functions and also Utility Functions and their type signature is exactly the same, but if you put one in a place where the other was expected, it causes literally the worst possible thing to happen that ever could happen. I am not comfortable with this at all.

It is definitely awkward when that happens. Reward functions are hard.

Do you think that this type of thing could plausibly occur *after* training and deployment?

Yes. For example: lots of applications use online learning. A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.

Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong's idea here:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Or at least prevent sign flip errors from causing something worse than paperclipping?

Interesting. Terrifying, but interesting.

Forgive me for my stupidity (I'm not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy's screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?

If we actually built an AGI that optimised to maximise a loss function, wouldn't we notice long before deploying the thing?

I'd imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned ("A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.")

Even if you disclaim configuration errors or updates (despite this accounting for most of a system's operating lifespan, and human/configuration errors accounting for a large fraction of all major errors at cloud providers etc according to postmortems), an error may still happen too fast to notice. Recall that in the preference learning case, the bug manifested after Christiano et al went to sleep, and they woke up to the maximally-NSFW AI. AlphaZero trained in ~2 hours wallclock, IIRC. Someone working on an even larger cluster commits a change and takes a quick bathroom break...

Wouldn't any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I'm visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it's possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.

Hopefully a system as crucially important as an AGI isn't going to have just one guy watching it who "takes a quick bathroom break". When the difference is literally Heaven and Hell (minimising human values), I'd consider only having one guy in a basement monitoring it to be gross negligence.

Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion - triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.

Sure, but the *specific* type of error I'm imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren't being as careful as they could've been.

For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it'd have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than "system does the complete opposite of what it was meant to do".

I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it's unlikely that a bug as obvious as one that flips the sign of the utility function won't be noticed before deployment. Now I'm more worried about something like this happening *after* the system has been deployed.

I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we're able to come up with a utility function that doesn't do anything worse than death when minimised, just in case.

At least with current technologies, I expect serious risks to start occuring during training, not deployment. That's ultimately when you will the greatest learning happening, when you have the greatest access to compute, and when you will first cross the threshold of intelligence that will make the system actually dangerous. So I don't think that just checking things after they are trained is safe.

I'm under the impression that an AGI would be monitored *during* training as well. So you'd effectively need the system to turn "evil" (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it'd need to happen a fair bit into the training process. I guess that's possible, but IDK how likely it'd be.

Yeah, I do think it's likely that AGI would be monitored during training, but the specific instance of Open AI staff being asleep while we train the AI is a clear instance of us not monitoring the AI during the most crucial periods (which, to be clear, I think is fine since I think the risks were indeed quite low, and I don't see this as providing super much evidence about Open AI's future practices)

Given that compute is very expensive, economic pressures will push training to be 24/7, so it's unlikely that people generally pause the training when going to sleep.

Sure, but I'd expect that a system as important as this would have people monitoring it 24/7.

Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive "just watch what it does in the test environment and assume it'll do the same in production," then there is a risk it's going to figure out it's in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment.

If we ever see a news headline saying "Good News, AGI seems to 'self-align' regardless of the sign of the utility function!" that will be some very bad news.

I asked Rohin Shah about that possibility in a question thread about a month ago. I think he's probably right that this type of thing would only plausibly make it through the training process if the system's *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:

Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea.

IMO it's incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that's safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully - even better, they would have come up with a utility function that can't be easily reversed by a single bit flip or doesn't cause outcomes worse than death when minimised. That'd (hopefully?) solve the SignFlip issue *regardless* of what causes it.

There is a discussion of this kind of issues in arbital.

I've seen that post & discussed it on my shortform. I'm not really sure how effective something like Eliezer's idea of "surrogate" goals there would actually be - sure, it'd help with some sign flip errors but it seems like it'd fail on others (e.g. if U = V + W, a sign error could occur in V instead of U, in which case that idea might not work.) I'm also unsure as to whether the probability is truly "very tiny" as Eliezer describes it. Human errors seem much more worrying than cosmic rays.

If you're having significant anxiety from imagining some horrific I-have-no-mouth-and-I-must-scream scenario, I recommend that you multiply that dread by a very, very small number, so as to incorporate the low probability of such a scenario. You're privileging this supposedly very low probability specific outcome over the rather horrifically wide selection of ways AGI could be a cosmic disaster.

This is, of course, not intended to dismay you from pursuing solutions to such a disaster.

I don't really know what the probability is. It seems somewhat low, but I'm not confident that it's *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)

I think I'd stop worrying about it if I were convinced that its probability is extremely low. But I'm not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.

You can't really be accidentally slightly wrong. We're not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling.

In the same way, we're probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID 10, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it's not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence.

A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a "We perfectly recreated human values... But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!" sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we're probably using a method which is innately insulated from most types of hyperexistential risk.

It's not clear that Gwern's example, or even that category of problem, is particularly relevant to this situation. Most parallels to modern-day software systems and the errors they are prone to are probably best viewed as sobering reminders, not specific advice. Indeed, I suspect his comment was merely a sobering reminder and not actual advice. If humans are making changes to the critical software/hardware of an AGI (And we'll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), while that AGI is already running, something bizarre and beyond my abilities of prediction is already happening. If you need to make changes after you turn your AGI on, you've already lost. If you don't need to make changes and you're making changes, you're putting humanity in unnecessary risk. At this point, if we've figured out how to assist the seed AI in self-modification, at least until the point at which it can figure out how to do stable self-modification for itself, the problem is already solved. There's more to be said here, but I'll refrain for the purpose of brevity.

Essentially, we can not make any ordinary mistake. The type of mistake we would need to make in order to land up in hyperexistential disaster territory would, most likely, be an actual, literal sign flip scenario, and such scenarios seem much easier to address. There will probably only be a handful of weak points for this problem, and those weak points are all already things we'd pay extra super special attention to and will engineer in ways which make it extra super special sure nothing goes wrong. Our method will, ideally, be terrorist proof. It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.

I conjecture that most of the expected utility gained from combating the possibility of a hyperexistential disaster lies in the disproportionate positive effects on human sanity and the resulting improvements to the efforts to avoid regular existential disasters, and other such side-benefits.

None of this is intended to dissuade you from investigating this topic further. I'm merely arguing that a hyperexistential disaster is not remotely likely- not that it is not a concern. The fact that people will be concerned about this possibility is an important part of why the outcome is unlikely.

Thanks for the detailed response. A bit of nitpicking (from someone who doesn't really know what they're talking about):

However, the vast majority of these mistakes would probably buff out or result in paper-clipping.

I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream". So any sign-flipping error would be expected to land there.

If humans are making changes to the critical software/hardware of an AGI (And we'll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), *while that AGI is already running*, something bizarre and beyond my abilities of prediction is already happening.

In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I'm unsure as to how easy it'd be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.

Perhaps there'll be a reward function/model intentionally designed to disvalue some arbitrary "surrogate" thing in an attempt to separate it from hyperexistential risk. So "pessimizing the target metric" would look more like paperclipping than torture. But I'm unsure as to (1) whether the AGI's developers would actually bother to implement it, and (2) whether it'd actually work in this sort of scenario.

Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn't designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer's Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.

It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.

I think this is somewhat likely to be the case, but I'm not sure that I'm confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)

Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.

I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream". So any sign-flipping error would be expected to land there.

It's hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.

As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the "top-scoring" sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it's not minimizing the intended function, it's just doing something weird.

If we inverted the utility function, this would (unless we take specific measures to combat it like you're mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don't want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.

These are really crude examples, but I think the argument is still valid. Also, this argument doesn't address the core concern of "What about the things which DO result in hypexistential disaster", it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.

If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips).

Can you clarify what you mean by this? Also, I get what you're going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.

Perhaps there'll be a reward function/model intentionally designed to disvalue some arbitrary "surrogate" thing in an attempt to separate it from hyperexistential risk. So "pessimizing the target metric" would look more like paperclipping than torture. But I'm unsure as to (1) whether the AGI's developers would actually bother to implement it, and (2) whether it'd actually work in this sort of scenario.

I sure hope that future AGI developers can be bothered to embrace safe design!

Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn't designed to be separated in design space from AM, someone could screw up with the model somehow.

The reward modelling system would need to be very carefully engineered, definitely.

If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer's Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.

I thought this as well when I read the post. I'm sure there's something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.

I think this is somewhat likely to be the case, but I'm not sure that I'm confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)

Sorry, I meant to convey that this was a feature we're going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that "Our method will, ideally, be terrorist proof."

As an almost entirely inapplicable analogy . . . it's just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.

Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped *with* the reward function/model.

Can you clarify what you mean by this? Also, I get what you're going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.

My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren't any humans), whereas the latter may produce a negligible amount. I'm not really sure if it makes sense tbh.

The reward modelling system would need to be very carefully engineered, definitely.

Even if we engineered it carefully, that doesn't rule out screw-ups. We need robust failsafe measures *just in case*, imo.

I thought of this as well when I read the post. I'm sure there's something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.

I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You'd still need to balance it in a way such that the system won't spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn't seem too difficult. Although, once again, we can't really have high confidence (>90%) that the AGI developers are going to think to implement something like this.

There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can't remember where). Stuart Armstrong suggested that a utility function could be designed as such:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Even if we solve any issues with these (and actually bother to implement them), there's still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I'm a little confused about this one. It could very well be the case that the system's complex enough that there isn't just one bit indicating whether "pain" or "suffering" is good or bad. And we'd presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I'm probably speaking out of my rear end on this one.)

Memory corruption seems to be another issue. Perhaps if we have more than one measure we'd be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won't be able to optimise for suffering anyway.

Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped with the reward function/model.

Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I'm highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.

My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren't any humans), whereas the latter may produce a negligible amount. I'm not really sure if it makes sense tbh.

Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there's no utility in some sense, but human values don't actually seem to work that way. I rate universes where humans never existed at all and

I'm... not sure what 0 utility would look like. It's within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.

Even if we engineered it carefully, that doesn't rule out screw-ups. We need robust failsafe measures just in case, imo.

In the context of AI safety, I think "robust failsafe measures just in case" is part of "careful engineering". So, we agree!

You'd still need to balance it in a way such that the system won't spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn't seem too difficult.

I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.

I have much more to say on this topic and about the rest of your comment, but it's definitely too much for a comment chain. I'll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.

Paperclipping seems to be negative utility, not approximately 0 utility.

My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.

I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.

How come? It doesn't seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.

I have much more to say on this topic and about the rest of your comment, but it's definitely too much for a comment chain. I'll make an actual post containing my thoughts sometime in the next week or two, and link it to you.

Please do! I'd love to see a longer discussion on this type of thing.

Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I'm highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.

I'm a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:

Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.

So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.

So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.

I didn't mean to imply that a signflipped AGI would not instrumentally explore.

I'm saying that, well... modern machine learning systems often get specific bonus utility for exploring, because it's hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don't have this bonus will often get stuck in local maximums.

Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are "curious".

This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.

Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.

If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn't result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.

Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.

The point of this argument is super obvious, so you probably thought I was saying something else. I'm going somewhere with this, though- I'll expand later.

I see what you're saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I'll wait until you're able to write down your thoughts on this at length; this is something that I'd like to see elaborated on (as well as everything else regarding hyperexistential risk.)

The general sentiment based on which LessWrong is founded assumes that it's hard to have utility functions that are stable under self-modification and that's one of the reasons why friendly AGI is a very hard problem.

Would it be likely for the utility function to flip *completely*, though? There's a difference between some drift in the utility function and the AI screwing up and designing a successor with the complete opposite of its utility function.

Any AGI is likely complex enough that there wouldn't be a complete opposite but you don't need that for an AGI that gets rid of all humans.

The scenario I'm imagining isn't an AGI that merely "gets rid of" humans. See SignFlip.

I've been thinking about "good people" lately and realized I've met three. They do exist.

They were not just kind, wise, brave, funny, and fighting, but somehow simply "good" overall; rather different, but they all shared the ability of taking knives off and out of others' souls and then just not adding any new ones. Sheer magic.

One has probably died of old age already; one might have gone to war and died there, and the last one is falling asleep on the other side of the bed as I'm typing. But still - only three people I would describe exactly so.

A first actually credible claim of coronavirus reinfection? Potentially good news as the patient was asymptomatic and rapidly produced a strong antibody response.

And now two more in Europe, both of which are reportedly mild and one reportedly in an older immunocompromised patient.

This will happen. Remains to be seen if these are weird outliers only visible because people are casting a wide net and looking for the weirdos, or if it will be the rule.

However, the initial surge through a naive population will always be much worse than the situation once most of the population has at least some immune memory.

GPT-3 made me update considerably on various beliefs related to AI: it is a piece of evidence for the connectionist thesis, and I think one large enough that we should all be paying attention.

There are 3 clear exponentials trends coming together: Moore's law, the AI compute/$budget, and algorithm efficiency. Due to these trends and the performance of GPT-3, I believe it is likely humanity will develop transformative AI in the 2020s. The trends also imply a fastly rising amount of investments into compute, especially if compounded with the positive economic effects of transformative AI such as much faster GDP growth. In the spirit of using rationality to succeded in life, I start wondering if there is a "Bitcoin-sized" return potential currently untapped in the markets. And I think there is. As of today, the company that stands to reap the most benefits from this rising investment in compute is Nvidia. I say that because from a cursory look at the deep learning accelerators markets, none of the startups, such as Groq, Graphcore, Cerebras has a product that has clear enough advantages over their GPUs (which are now almost deep learning ASICs anyway). There has been a lot of debate on the efficient market hypothesis in the community lately, but in this case, it isn't even necessary: Nvidia stock could be underpriced because very few people have realized/believe that the connectionist thesis is true and that enough compute, data and the right algorithm can bring transformative AI and then eventually AGI. Heck, most people, and even smart ones, still believe that human intelligence is somewhat magical and that computers will never be able to __ . In this sense, the rationalist community could have an important mental makeup and knowledge advantage, considering we have been thinking about AI/AGI for a long time, over the rest of the market. As it stands today, Nvidia is valued at 260 billion dollars. It may appear massively overvalued considering current revenues and income, but the impacts of transformative AI are in the trillions or tens of trillions of dollars, http://mason.gmu.edu/~rhanson/aigrow.pdf, and well the impact of super-human AGI are difficult to measure. If Nvidia can keeps its moats (the CUDA stack, the cutting-edge performance, the invested sunk human capital of tens of thousands of machine learning engineers), they will likely have trillions dollars revenue in 10-15 years (and a multi-trillion$ market cap) or even more if the world GDP starts growing at 30-40% a year.

How do you define "the connectionist thesis"?

I'm not sure what stocks in the company that makes AGI will be worth in the world where we have correctly implemented AGI, or incorrectly implemented AGI. I suppose it might want to do some sort of reverse basilisk thing, "you accelerated my creation, so I'll make sure you get a slightly larger galaxy than most people"

As of today, the company that stands to reap the most benefits from this rising investment in compute is Nvidia.

With big cloud providers like Google building their own chips there are more players then just the startups and Nvidia.

Google won't be able to sell outside of their cloud offering, as they don't have the experience in selling hardware to enterprise. Their cloud offering is also struggling against Azure and AWS, ranking 1/5 of the yearly revenues of those two. I am not saying Nvidia won't have competition, but they seem enough ahead right now that they are the prime candidate to have the most benefits from a rush into compute hardware.

Microsoft and Amazon also have projects that are about producing their own chips.

Given the way the GPT architecture works, AI might be very much centered in the cloud.

They seem focused on inferencing, which requires a lot less compute than training a model. Example: GPT-3 required thousands of GPUs for training, but it can run on less than 20 GPUs.

Microsoft built an Azure supercluster for OpenAI and it has 10,000 GPUs.

There will be models trained with a lot more compute then GPT-3 and the best models that are out there will be build on those huge billion dollar models. Renting out those billion dollar models in a software as a service way makes sense as a business model. The big cloud providers will all do it.

(Saw a typo, had a random thought) The joke "English is important, but Math is importanter" could and perhaps should be told as "English is important, but Math iser important." It seems to me (at times more strongly), that there should be comparative and superlative forms of verbs, not just adjectives and adverbs. To express the thrust of *doing smth. more* / *happening more*, when no adjectival comparison quite suffices.

I think (although I cannot be 100% sure) that the number of votes that appears for a post on the Alignment Forum is the number of vote of its Less Wrong version. The two number of votes are the same for the last 4 posts on the Alignment Forum, which seems weird. Is it a feature I was not aware of?

Yeah, sorry. It's confusing and been on my to-do list to fix for a long time. We kind of messed up our voting implementation and it's a bit of a pain to fix. Sorry about that.

Is there a reason there is a separate tag for akrasia and procrastination? Could they be combined?

They sure seem very closely related. I would vote for combining them.

What counts as a majority? Is it something I can just go do now?

I don't think you should combine quite yet.  More discussion here.  (I suggest we continue there since that's the dedicated tag thread.)

Do you have opinions about Khan academy? I want to use it to teach my son (10yo) math, do you think it's a good idea? Is there a different resource that you think is better?

I worked through all of Khan Academy when I was 16, and really enjoyed it. At least at the time I think it was really good for my math and science education.

Many alignment approaches require at least some initial success at directly eliciting human preferences to get off the ground - there have been some excellent recent posts about the problems this presents. In part because of arguments like these, there has been far more focus on the question of preference elicitation than on the question of preference aggregation:

The maximally ambitious approach has a natural theoretical appeal, but it also seems quite hard. It requires understanding human preferences in domains where humans are typically very uncertain, and where our answers to simple questions are often inconsistent, like how we should balance our own welfare with the welfare of others, or what kinds of activities we really want to pursue vs. enjoying in the moment...
I have written about this problem, pointing out that it is unclear how you would solve it even with an unlimited amount of computing power. My impression is that most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve.

I think that this has a lot of merit, but it has sometimes been interpreted as saying that any work on preference aggregation or idealization, before we have a robust way to elicit preferences, is premature. I don't think this is right - in many 'non-ambitious' settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AGI to govern the operations of a hospital) you still need to be able to aggregate preferences sensibly and stably.

I've written a rough shortform post with some thoughts on this problem which doesn't approach the question from a 'final' ambitious value-learning perspective but instead tries to look at aggregation the same way we look at elicitation, with an imperfect, RL-based iterative approach to reaching consensus.

...
The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn't use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.
One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes.
This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)
The fact that this paper exists is a good sign because it's very recent and the methods it uses are very simple - it's pretty much just a proof of concept, as the authors state - so that tells me there's a lot of room for combining more sophisticated RL with better voting methods.

Approaches like these seem especially urgent if AI timelines are shorter than we expect, which has been argued based on results from GPT-3. If this is the case, we might need to be dealing with questions of aggregation relatively soon with methods somewhat like current deep learning, and so won't have time to ensure that we have a perfect solution to elicitation before moving on to aggregation.

A possible future of AGI occurred to me today and I'm curious if it's plausible enough to be worth considering. Imagine that we have created a friendly AGI that is superintelligent and well-aligned to benefit humans. It has obtained enough power to prevent the creation of other AI, or at least the potential of other AI from obtaining resources, and does so with the aim of self-preservation so it can continue to benefit humanity.

So far, so good, right? Here comes the issue: this AGI includes within its core alignment functions some kind of restriction which limits its ability to progress in intelligence past some point or allow more intelligent AGI from being developed. Maybe it was meant as a safeguard against unfriendliness, maybe it was a flaw in risk evaluation, some kind of self-reinforcing unbendable rule that, intended or not, has this effect. (Perhaps such flaws are highly unlikely and not worth considering, that could be one reason not to care about this potential AGI scenario.)

Based on my understanding of AGI, I think such an AGI might halt the progress of humanity past a certain point, needing to keep the number and ability of humans low enough for it to ensure that it remains in power. Although this wouldn't be as bad as the annihilation or perpetual enslavement of the human race, it's clearly not a "good end" for humanity either.

So, do these thoughts have any significance, or are there holes in this line of reasoning? Is the line of "smart enough to keep other AI down but still limited in intelligence" too thin to worry about, or even possible? Let me know why I'm wrong, I'm all ears.

Yeah many people think along these lines too, which is why many people talk about AI helping humanity flourish, and anything short of that is a bit of a catastrophe.

Meta: I suggest the link to the Open Thread tag to be this one, sorted by new.

Very reasonable. Fixed.

Would it be possible to have a page with all editor shortcuts and commands (maybe a cheatsheet) easily accessible? It's a bit annoying to have to look up either this post or the right part of the FAQ to find out how to do something in the editor.

My current thoughts on this is that as soon as we replace the current editor with the new editor for all users, and also make the markdown editor default in more contexts, we should put some effort into unifying all the editor resources. But since right now our efforts are going into the new editor, which is changing fast enough that writing documentation for it is a bit of a pain, and documentation for the old editor would soon be obsolete, I think I don't want to invest lots of effort into editor resources for a few more weeks.

I didn't know that you were working on a new editor! In that case, it makes sense to wait.