Wiki Contributions


I don't understand the distinction you draw between free agents and agents without freedom. 

If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.

I am reminded of E.T. Jaynes' position on the notion of 'randomization', which I will summarize as "a term to describe a process we consider too hard to model, which we then consider a 'thing' because we named it."

How is this agent any more free than the expected utility maximizer, other than for the reason that I can't conveniently extrapolate the outcome of its modification of its utility function?

It seems to me that this only shifts the problem from "how do we find a safe utility function to maximize" to "how do we find a process by which a safe utility function is learned", and I would argue the consideration of the latter is already a mainstream position in alignment.

If I have missed a key distinguishing property, I would be very interested to know.

I believe you misinterpreted the quote from disturbance. They were implying that they would bring about AGI at the moment before their brain would be unsalvageable by AGI such that they could be repaired, assumedly in expectation of immortality.

I also don't think the perspective that we would likely fail as a civilization without AGI is common on LessWrong. I would guess that most of us would expect a smooth-ish transition to The Glorious Future in worlds where we coordinate around [as in don't build] AI. In my opinion the post is good even without this claim however.

models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.

Is this true? In Thoughts On (Solving) Deep Deception, Jozdien gives the following example that suggests otherwise to me:

Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent.

Eventually, organisms evolved to recognize and pause their replication in the test environment and resume normal replication when released. Ofria tried solving this by making it harder to detect the test environment, but the organisms simply started to randomly accelerate their replication so as to slip through the test environment some of the time.

I don't see why a model would need to be cognitively able to process its own alignment for its alignment to be improper, and I think this assumption is quite key to the main claims of the post.

unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs

Are you able to provide an example of the kind of thing that would constitute such a theoretical triumph? Or, if not; a maximally close approximation in the form of something that exists currently?

I'm in high school myself and am quite invested in AI safety. I'm not sure whether you're requesting advice for high school as someone interested in LW, or for LW and associated topics as someone attending high school. I will try to assemble a response to accommodate both possibilities.

Absorbing yourself in topics like x-risk can make school feel like a waste of time. This seems to me to be because school is mostly a waste of time (which is a position I held before becoming interested in AI safety,) but disengaging with the practice entirely also feels incorrect. I use school mostly as a place to relax. Those eight hours are time I usually have to write off as wasted in terms of producing a technical product, but value immensely as a source of enjoyment, socializing and relaxation. It's hard for me to overstate just how pleasurable attending school can be when you optimize for enjoyment, and if permitted by your school's environment; a suitable place for intellectual progress in an autodidactic sense also, presuming you aren't being provided that in the classroom. If you do feel that the classroom is an optimal learning environment for you, I don't see why you shouldn't just maximize knowledge extraction.

For many of my peers, school is practically their life. I think that this is a shame, but social pressures don't let them see otherwise, even when their actions are clearly value negative. Making school just one part of your life instead of having it consume you is probably the most critical thing to extract from this response. The next is to use its resources to your advantage. If you can network with driven friends or find staff willing to push you/find you interesting opportunities, you absolutely should. I would be shocked if there wasn't at least one staff member at your school passionate about something you were too. Just asking can get you a long way, and shutting yourself off from that is another mistake I made in my first few years of high school, falsely assuming that school simply had nothing to offer me.

In terms of getting involved with LW/AI safety, the biggest mistake I made was being insular, assuming my age would get in the way of networking. There are hundreds of people available at any given time who probably share your interests but possess an entirely different perspective. Most people do not care about my age, and I find that phenomena especially prevalent in the rationality community. Just talk to people. Discord and Slack are the two biggest clusters for online spaces, and if you're interested I can message you invites. 

Another important point, particularly as a high school student is not falling victim to group think. It's easy to be vulnerable to the failing in your formative years, but it can massively skew your perspective, even when your thinking seems unaffected. Don't let LessWrong memetics propagate throughout your brain too strongly without good reason.

I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it's probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)

The problem I'm interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.

Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:

  • Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
  • Simulating an agent is not fundamentally different to creating one in the real world.
  • Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of 'complexity theft' as described in the post.[1]
  • The Löbian Obstacle accurately describes why an agent cannot obtain a formal guarantee via design-inspection of its subsequent agent.
  • For a simulator to be safe, all simulacra need to be aligned unless we figure some upper bound on "programs of this complexity are too simple to be dangerous," at which point we would consider simulacra above that complexity only.

I'll try to justify my approach with respect to one or more of these claims, and if I can't, I suppose that would give me strong reason to believe the method is overly complicated.

  1. ^

    This doesn't have to be resource acquisition, just any negative action that we could reasonably expect a rational agent to pursue.

The issue I have with pivotal act models is that they presume an aligned superintelligence would be capable of bootstrapping its capabilities in such a way that it could perform that act before the creation of the next superintelligence. Soft takeoff seems a very popular opinion now, and isn't conducive to this kind of scheme.

Also, if a large org were planning a pivotal act I highly doubt they would do so publicly. I imagine subtly modifying every GPU on the planet, melting them or doing anything pivotal on a planetary scale such that the resulting world has only one or a select few superintelligences (at least until a better solution exists) would be very unpopular with the public and with any government.

I don't think the post explicity argues against either of these points, and I agree with what you have written. I think these are useful things to bring up in such a discussion however.

I have enjoyed your writings both on LessWrong and on your personal blog. I share your lack of engagement with EA and with Hanson (although I find Yudkowsky's writing very elegant and so felt drawn to LW as a result.) If not the above, which intellectuals do you find compelling, and what makes them so by comparison to Hanson/Yudkowsky?

In (P2) you talk about a roadblock for RSI, but in (C) you talk about about RSI as a roadblock, is that intentional?

This was a typo. 

By "difficult", do you mean something like, many hours of human work or many dollars spent?  If so, then I don't see why the current investment level in AI is relevant.  The investment level partially determines how quickly it will arrive, but not how difficult it is to produce.

The primary implications of the difficulty of a capabilities problem in the context of safety is when said capability will arrive in most contexts. I didn't mean to imply that the investment amount determined the difficulty of the problem, but that if you invest additional resources into a problem it is more likely to be solved faster than if you didn't invest those resources. As a result, the desired effect of RSI being a difficult hurdle to overcome (increasing the window to AGI) wouldn't be realized.

Load More