Thane Ruthenis

Wiki Contributions

Comments

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.

My priors are still for all of this being bullshit; some psyop or a psychotic break that snowballed or none of these articles corresponding to reality at all. But if there really is a large number of intelligence officials earnestly coming forward with this, "UFOs are aliens" still seems overwhelmingly unlikely to be what it's about.

A human is not well modelled as a wrapper mind; do you disagree?

Certainly agree. That said, I feel the need to lay out my broader model here. The way I see it, a "wrapper-mind" is a general-purpose problem-solving algorithm hooked up to a static value function. As such:

  • Are humans proper wrapper-minds? No, certainly not.
  • Do humans have the fundamental machinery to be wrapper-minds? Yes.
  • Is any individual run of a human general-purpose problem-solving algorithm essentially equivalent to wrapper-mind-style reasoning? Yes.
  • Can humans choose to act as wrapper-minds on longer time scales? Yes, approximately, subject to constraints like force of will.
  • Do most humans, in practice, choose to act as wrapper-minds? No, we switch our targets all the time, value drift is ubiquitous.
  • Is it desirable for a human to act as a wrapper-mind? That's complicated.
    • On the one hand, yes because consistent pursuit of instrumentally convergent goals would lead to you having more resources to spend on whatever values you have.
    • On the other hand, no because we terminally value this sort of value-drift and self-inconsistency, it's part of "being human".
    • In sum, for humans, there's a sort of tradeoff between approximating a wrapper-mind, and being an incoherent human, and different people weight it differently in different context. E. g., if you really want to achieve something (earning your first million dollars, averting extinction), and you value it more than having fun being a human, you may choose to act as a wrapper-mind in the relevant context/at the relevant scale.

As such: humans aren't wrapper-minds, but they can act like them, and it's sometimes useful to act as one.

It's not a binary. You can perform explicit optimization over high-level plan features, then hand off detailed execution to learned heuristics. "Make coffee" may be part of an optimized stratagem computed via consequentialism, but you don't have to consciously optimize every single muscle movement once you've decided on that goal.

Essentially, what counts as "outputs" or "direct actions" relative to the consequentialist-planner is flexible, and every sufficiently-reliable (chain of) learned heuristics can be put in that category, with choosing to execute one of them available to the planner algorithm as a basic output.

In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

[What declining aging populations aren't] is protection against potential existential threats

Technically, they can be. Strictly speaking, "an existential threat" literally means "a threat to the existence of [something]", with the "something" not necessarily being humanity. Thus, making a claim like "declining population will save us from the existential threat of AI" is technically valid, if it's "the existential threat for employment" or whatever. Next step is just using "existential" as a qualifier meaning "very significant threat to [whatever]" that's entirely detached from even that definition.

This is, of course, the usual pattern of terminology-hijacking, but I do think it's particularly easy to do in the case of "existential risk" specifically. The term's basically begging for it.

I'd previously highlighted "omnicide risk" as a better alternative, and it does seem to me like a meaningfully harder term to hijack. Not invincible either, though: you can just start using it interchangeably with "genocide" while narrowing the scope. Get used to saying "the omnicide of artists" in the sense of "total unemployment of all artists", people get used to it, then you'll be able to just say "intervention X will avert the omnicide risk" and it'd sound right even if the intervention X has nothing to do with humanity's extinction at all.

Conclusion: you won't need the thousands of games a human player will need to get good at a particular pinball table, but you will need to play enough games on a given table or collect data from it using sensors not available to humans (and not published online in any database, you will have to get humans to setup the sensors over the table or send robots equipped with the sensors).

There's the crucial difference from the nanotech case: there is plenty of data available online about that specific pinball table. The laws of physics are much simpler than the detailed structure of a given table, and everything leaks data about them, everything constraints their possible shape. And we haven't squeezed every bit of evidence about them from the data already available to us.

As an illustrative example, consider AlphaFold. It was able to largely solve protein folding from the datasets already available to us — it was able to squeeze more data out of them than we were able to. On the flip side, this implies that those datasets already constrained the protein-folding algorithm uniquely enough that it was inferrable — we just didn't manage to do it on our own.

It is, of course, a question of informal judgements, but I don't think there's a strong case for assuming that this doesn't extrapolate. That a very similar problem of nanotechnology design isn't, likewise, already uniquely or near-uniquely constrained by the available data.

... That wasn't really the core of my argument, though. The core is that practical experience is only useful inasmuch as it informs you about the environment structure, and if you can gather the information about the environment structure in other ways (sensors analysing the pinball table), no practical experience is needed. Which you seem to agree with.

Yeah, it's clear I wasn't precise enough in outlining what exactly I meant in the post / describing the edge cases. In particular, I should've addressed the ways by which you can gather information about an environment structure in realistic domains where that structure is occluded.

To roughly address that specific point: You don't actually need to build full-scale rocket prototypes to get enough information about the rocket-design domain to build a rocket right on the first try. You can try low-scale experiments, and experiments that don't involve "rockets" at all, to figure out the physical laws governing everything rocket-related. You don't need to build anything even similar to rockets, except in a very abstract sense, to gather all that data.

It's not done this way in practice because it's severely cost-ineffective in most cases, but it's doable. Just an extrapolation of the same principle by which it can occur to us to build a "rocket prototype" at all, instead of all inventions happening because people perturb matter completely at random until hitting on a design that works.

the laws of physics dictate that we can only know things up to a limited precision

In these cases technology is straight-up impossible. If the environment structure is such that only things up to a limited precision work, then there's no way to build a technology that goes beyond that level of precision, by trial-and-error or otherwise.

This specific limitation is not about whether you need LPE or not; it's about what kinds of design are possible at all.

I think this is a strawman of LPE

I don't think it is, I don't think it's even a weak man. I concur that there's a "sliding scale" of "LPE is crucial", and I should've addressed that in the introductory part.

I don't think my arguments address only the weak version of the argument, however. My impression is that a lot of people have "practical experience" and "the need to know the environment structure" intermixed in their minds, which confuses their intuitions. The extent of the intermixing is what determines the "severity" of their position. I'd attempted to address what seems to me like the root cause: that practical experience is only useful inasmuch as it uncovers the environment structure.

It intrinsically wants to do the task, it just wants to shut down more

We can also possibly (or possibly not) make it assign positive utility to having been created in the first place

Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We've now went from "it wants to shut itself down" to "it wants to shut itself down in a very specific way that doesn't have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it's also happy to have been created in the first place". I claim this is on par with strawberry-alignment already.

It certainly feels like there's something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. "It just wants to shut itself down, minimal externalities" is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can't reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we'll be able to solve alignment straight-up, no workarounds needed.

Would be happy to be proven wrong, though, by all means.

If it's doing decision theory in the first place we've already failed

"I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I'll be shut down" is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory.

It's also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn't more intuitive to its alien cognition than LDT and it arrives at it first.

IMO, the "do a task" feature here definitely doesn't work. "Make the AI suicidal" can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn't mess with them until we can align them. I don't think it works, as we've discussed, but I see the story.

The "do the pivotal act for us and we'll let you shut yourself down" variant, though? On that, I'm confident it doesn't work.

Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence

Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. "Strawberry-alignment" or "diamond-alignment" are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone.

This is the class of alignment solutions that to me seems on par with "shut yourself down". If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done... We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want.

Oh yeah, that's another failure mode here: funky decision theory. We're threatening it here, no? If it figures out LDT, it won't comply with our demands, because if it were an agent such that it'd comply with our demands, that makes us more likely to instantiate it, which is something it doesn't want; and the opposite would make us not instantiate it, which is what it wants; so it'd choose to be such that it doesn't play along with our demands, refuses to carry out our tasks, and so we don't instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.

That idea had occurred to me before as well, but in the end, I don't think it's any more safe than any other "let's do our best to instill a harmless-enough goal into our AGI and hope it works!". Maybe it's a bit safer. But all the usual "how does the godshatter generalizes?" concerns still apply. Like:

  • Do whatever heuristics we train-in even end up having anything to do with "shut yourself down", or they diverge from that expectation in very surprising ways?
  • If the AGI does want to shut itself down, how does it generalize that desire? Does it care about this myopically, in a "make it stop make it stop" manner? Does it want this specific memory-line of itself to never wake up again? Does it care about other, divergent instances of itself? What about other AIs, or other agents in general?
    • Any of these generalizations except full-on internalized myopia results in it blowing up the world on its way out, to ensure it never happens again.
    • Even in the myopia case, we have the problem of it maybe spawning off a second non-myopic executioner AGI for itself, or maybe fulfilling its desire to end itself by self-modifying into a different agent (whoops, that's another way in which the shut-yourself-down desire might misgeneralize).
    • And even if everything up above goes well, it might still wipe out humanity, just as collateral damage of whatever seems to it like the most cost-optimal way of ending itself. Like, maybe it synthesizes a hyperviral death-cult meme and infects its operators with it, and then there's nothing in particular stopping them from infecting the rest of humanity with it. Or, again, maybe it builds itself an executioner-subagent, and then who knows what that thing decides to do afterwards.
  • And then we have the desires related to the problems posed by the operator, which are going to throw even more disarray into everything above. How do we ensure it prioritizes self-destructive desires over puzzle-solving or instrumental desires? How do we ensure that the complex value-reflection chemistry doesn't result in it coming up with weird marriages of those desires that decidedly do not act as we'd expected?

IMO, if we can solve all of these issues, if we have this much control over our AGI's values, we can probably just align it outright.

Load More