All of Charlie Sanders's Comments + Replies

There's a parallelism here between the mental constructs you're referring to and the physical architecture of the human body. For instance, each lobe of our brain has been associated with various tasks, goals, and activities. When you take a breath, your Medulla Oblongata has taken in information about levels of carbon dioxide in the blood via pH monitoring, decided that your blood has too much carbon dioxide, and has sent a request to the respiratory center to breathe. But you've also got a cerebral cortex that also gets a say in the decisions made by the... (read more)

I do not believe that 3a is sufficiently logically supported. The criticism of AI risk that have seemed the strongest to me have been about how there is no engagement in the AI alignment community about the various barriers that undercut this argument. Against them, The conjecture about what protein folding and ribosomes might one have the possibility to do really weak counterargument, based as it is on no empirical or evidentiary reasoning.

Specifically, I believe further nuance is needed about the can vs will distinction in the assumption that the first A... (read more)

I'm not sure I've parsed this correctly, but if I have, can I ask what unsupported conjecture you think undergirds this part of the argument?  It's difficult to say what counts as "empirical" or "evidentiary" reasoning in domains where the entire threat model is "powerful stuff we haven't managed to build ourselves yet", given we can be confident that set isn't empty.  (Also, keep in mind that nanotech is merely presented as a lower bound of how STEM-AGI might achieve DSA, being a domain where we already have strong reasons to believe that significant advances which we haven't yet achieved are nonetheless possible.) Why?  This doesn't seem to be it worked with humans, where it was basically a step function from technology not existing, to existing. This sure is assuming a good chunk of the opposing conclusion. And, sure, but it's not clear why any of this matters?  What is the thing that we're going to (attempt) to do with AI, if not use it to solve real-world problems?

My intuition is that a simulation such as the one being proposed would take far longer to develop than the timeline outlined in this post. I’d posit that the timeline would be closer to 60 years than 6.

Also, a suggestion for tl;dr: The Truman Show for AI.

Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related, but focuses more on knowledge containment to avoid deception issues during evaluations). Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent's world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture. LOVES is instead built on the assumption that alignment can not strongly dictate the architecture - as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.

My understanding is that the world model is more like a very coarse projection of the world than a simulation

It's not the case that the AGI has to be fooled into thinking the simulation is real like in the Truman Show (I like name tho!).

Davidad only tries to achieve 'safety' - not alignment. Indeed the AI may be fully unaligned.

The proposal is different from simulation propoals like Jacob Cannell's LOVE in a simbox where one tries to align the values of the AI.

In davidads proposal the actual AGI is physically boxed and cannot interact with the world except... (read more)

I don't have the same intuition for the timeline, but I really like the tl;dr suggestion!


Davidad seems to be aiming for what I'd call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.

I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI's alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell's LOVE in a simbox.

Is 80% confidence the best we can achieve? I don't know.

Agreed. A common failure mode in these discussions is to treat intelligence as equivalent to technological progress, instead of as an input to technological progress. 

Yes, in five years we will likely have AIs that will be able to tell us exactly where it would be optimal to allocate our scientific research budget. Notably, that does not mean that all current systemic obstacles to efficient allocation of scarce resources will vanish. There will still be the same perverse incentive structure for funding allocated to scientific progress as there is toda... (read more)

One of the unstated assumptions here is that an AGI has the power to kill us. I think it's at least feasible that the first AGI that tries to eradicate humanity will lack the capacity to eradicate humanity - and any discussion about what an omnipotent AGI would or would not do should be debated in a universe where a non-omnipotent AGI has already tried and failed to eradicate humanity. 

That is, many of the worlds with an omnipotent AGI already had a non-omnipotent AGI that tried and failed to eradicate humanity. Therefore, when discussing worlds with an omnipotent AGI, it's relevant to bring up the possibility that there was a near-miss in those worlds in the past. (But the discussion itself can take place in a world without any near-misses, or in a world without any AGIs, with the referents of that discussion being other worlds, or possible futures of that world.)

In many highly regulated manufacturing organizations there are people working for the organization whose sole job is to evaluate each and every change order for compliance to stated rules and regulations - they tend to go by the title of Quality Engineer or something similar. Their presence as a continuous veto point for each and every change, from the smallest to the largest, aligns organizations to internal and external regulations continuously as organizations grow and change.

This organizational role needs to have an effective infrastructure supporting ... (read more)

As someone that interacts with Lesswrong primarily via an RSS feed of curated links, I want to express my appreciation for curation when it’s done early enough to be able to participate early in the comment section development lifestyle. Kudos for quick curation here.

What's your definition of quick? This one is curated 17 days after posting, I wouldn't have thought that quick.

How else were people thinking this ban was going to be able to go into effect? America has a Constitution that defines checks and balances. This legislation is how you do something like "Ban TikTok" without it being immediately shot down in court. 

It's verbatim. I think it picked up on the concept of the unreliable narrator from the H.P. Lovecraft reference and incorporated it into the story where it could make it fit - but then, maybe I'm just reading into things. It's only guessing the next word, after all!

Just to call it out, this post is taking the Great Man Theory of historical progress as a given, whereas my understanding of the theory is that it’s highly disputed/controversial in academic discourse.

I'll admit, my prior expectations on Great Man Theory is that it's probably right, if not in every detail, due to the heavy tail of impact, where a few people generate most of the impact.
  1. It would by definition not be bad thing. "Bad thing" is a low-effort heuristic that is inappropriate here, since I interpret "bad" to mean that which is not good and good includes aggregate human desires which in this scenario has been defined to include a desire to be turned into paperclips. 
  2. The ideal scenario would be for humans and AIs to form a mutually beneficial relationship where the furtherance of human goals also furthers the goals of AIs. One potential way to accomplish would be to create a Neuralink-esque integrations of AI into human biolo
... (read more)
1Karl von Wendt1y
I don't see how a Neuralink-like connection would solve the problem. If a superintelligent AI needs biological material for some reason, it can simply create it, or it could find a way to circumvent the links if they make it harder to reach its goal. In order for an AGI to "require" living, healthy, happy humans, this has to be part of its own goal or value system. 

See, this is the perfect encapsulation of what I'm saying - it could design a virus, sure. But when it didn't understand parts of the economy, that's all it would be - a design. Taking something from the design stage to the "physical, working product with validated processes that operate with sufficient consistency to achieve the desired outcome" is a vast, vast undertaking, one that requires intimate involvement with the physical world. Until that point is reached, it's not a "kill all humans but fail to paperclip everyone" virus, it's just a design conce... (read more)

There exists a diminishing returns to thinking about moves versus performing the moves and seeing the results that the physics of the universe imposes on the moves as a consequence. 

Think of it like AlphaGo - if it only ever could train itself by playing Go against actual humans, it would never have become superintelligent at Go. Manufacturing is like that - you have to play with the actual world to understand bottlenecks and challenges, not a hypothetical artificially created simulation of the world. That imposes rate-of-scaling limits that are currently being discounted. 

This is obviously untrue in both the model-free and model-based RL senses. There are something like 30 million human Go players who can play a game in two hours. AlphaGo was trained on policy gradients from, as it happens, on the order of 30m games; so it could accumulate a similar order of games in under a day; the subset of pro games can be upweighted to provide most of the signal - and when they stop providing signal, well then, it must have reached superhuman... (For perspective, a good 0.5m or whatever professional games used to imitation-train AG came from a single Go server, which was not the most popular, and that's why AlphaGo Master ran its pro matches on a different larger server.) Do this for a few days or weeks, and you will likely have exactly that, in a good deal less time than 'never', which is a rather long time. More relevantly, because you're not making a claim about the AG architecture specifically but about all learning agents in general: with no exploration, MuZero can bootstrap its model-based self-play from somewhere in the neighborhood of hundreds/thousands of 'real' games (as should not be a surprise as Go rules are simple), and achieves superhuman gameplay easily by self-play inside the learned model, with little need for any good human opponents at all; even if that is 3 orders of magnitude off, it's still within a day of human gameplay sample-size. Or consider meta-learning sim2real like Dactyl which are trained exclusively in silico on unrealistic simulations, and adapt within seconds to reality. So either way. The sample-inefficiency of DL robotics, DL, or R&D, is more of a fact about our compute-poverty than it is about the inherent necessity of interacting with the real world (which is both highly parallelizable, learnable offline, and far smaller than existing methods).
Lack of clarity when i think about this limits makes hard for me to see how end result will change if we could somehow "stop discounting" them.  It seems to my that we will have to be much more elaborete in describing parameters of this thought experiment. In particular we will have to agree on deeds and real world achivments that hypothetical AI has, so we will both agree to call it AGI (like writing interesting story and making illustrations so this particular research team now have a new revenue strem from selling it online - will this make AI an AGI?). And security conditions (air-gapped server-room?). This will get us closer to understanding "the rationale". But then your question is not about AGI but "superintelligent AI" so we will have to do elaborate describing again with new parameters. And that is what i expect Eliezer (alone and with other people) had done a lot. And look what it did to him (this is a joke but at the same time - not). So i will not be an active participant further.  It is not even about a single SAI in some box: compeeting teams, people running copies (legal and not) and changing code, corporate espionage, dirty code...

I'm a firm believer in avoiding the popular narrative, and so here's my advice - you are becoming a conspiracy theorist. You just linked to a literal conspiracy theory with regards to face masks, one that has been torn apart as misleading and riddled with factual errors. As just one example, Cochrane's review specifically did not evaluate "facemasks", it evaluated "policies related to the request to wear face masks". Compliance to the stated rule was not evaluated, and it is therefore a conspiracy theory to go from an information source that says "this pol... (read more)

How are these the same thing? * believing data which turns out to be wrong * asserting that people are conspiring to cause some bad outcome Some studies say masks work, some don't. If you incorrectly evaluate the evidence and believe that they don't work... how is that related to accusing people of conspiring? You've just analyzed the evidence wrong, but you haven't made any claims relating to any people, or plans, or schemes.

If all of EY's scenarios require deception, then detection of deception from rogue AI systems seems like a great place to focus on. Is there anyone working on that problem?

2Gerald Monroe1y Eric Drexler is.

Listening to Eliezer walk through a hypothetical fast takeoff scenario left me with the following question:  Why is it assumed that humans will almost surely fail the first time at their scheme for aligning a superintelligent AI, but the superintelligent AI will almost surely succeed the first time at its scheme for achieving its nonaligned outcome? 

Speaking from experience, it's hard to manufacture things in the real world. Doubly so for anything significantly advanced. What is the rationale for assuming that a nonaligned superintelligence won't... (read more)

A superintelligent AI could fail to achieve its goals and still get us all killed. For example, it could design a virus that kills all humans, and then find out that there were some parts of economy it did not understand, so it runs out of resources and dies, without converting the galaxy to paperclips. Nonetheless, humans remain dead.
It is the very same rationale that stands behind assumptions like "why Stockfish won't execute losing set of moves" - it is just that good at chess. Or better - it is just that smart when it come down to chess. In this thought experiment the way to go is not to "i see that AGI could likely fail at this step, therefore it will fail" but to keep thinking and inventing better moves for AGI to execute, which won't be countered as easily. It is an important part of "security mindset" and probably major reason why Eliezer speaks about lack of pessimism in the field.
It's hard to manufacture things, but it's not that hard to do so in a way that pretty much can't kill you. Just keep the computation that is you at somewhat of a distance from the physical experiment, and don't make stuff that might consume the whole earth. Making a general intelligence is an extraordinary special case: if you're actually doing it, it might self-improve and then kill you.

Agreed. Facilitation- focused jobs (like the ones derided in this post) might look like bullshit to an outsider, but in my experience they are absolutely critical to effectively achieving goals in a large organization.

For 99% of people, the only viable option to achieve this is refinancing your mortgage to take any equity out and resetting terms to a 30 year loan duration.

Do you happen to know the specifics of how these cadavers came to be available? There's recently been some investigative reporting on this topic. The broad gist is that most people who are "donating their bodies to science" probably don't get that companies will take those donated bodies and sell them for what are essentially tourist attractions like the one that you're participating in. 

3Professora Em 4mo
In my teaching career I have not come across "tourist attractions", but I've seen some things offered on the Internet for $3K or more that seem questionable to me.  However, the lab where Alok worked was not a tourist attraction.  He was among people preparing cadavers as prosections for college anatomy and physiology courses, not for a tourist attraction.  The work requires focus diligence and the end product becomes part of the educational program for UC Berkeley Extension and Merritt College.  
3Alok Singh1y
UCSF willed body program, on contract to Merritt College. 

I'm finding myself developing a shorthand heuristic to figure out how LDT would be applied in a given situation: assume time travel is a thing. 

If time travel is a thing, then you'd obviously want to one-box Newcomb's paradox because the predictor knows the future. 

If time travel is a thing, then you'd obviously want to cooperate in a prisoner's dilemma game given that your opponent knows the future. 

If time travel is a thing, then any strategies that involve negotiating with a superintelligence that are not robust to a future version of the superintelligence having access to time travel will not work. 

As someone that frequently has work reviewed by crossfunctional groups prior to implementation, I only object to change requests that I feel will make the product significantly worse. There's simply too much value lost in debating nitpicks.

3Adam Zerner1y
I think that is a fine but suboptimal approach. Suppose that the team has a disagreement in a change request but agrees that it is worth an appetite of 10 minutes of discussion. In that case, it is worth having a quick discussion. The problem is when when the discussion goes (way) beyond that, but setting an appetite should hopefully prevent that problem from happening.  In practice it could be difficult to stick to such an appetite even if it's agreed that that is the appetite, so I think the risk of not sticking to it is something that should be factored in. I also think that the value to such objections is usually mostly about learning for the future. Ie. you may object to some implementation that has a minor immediate affect on the product, but the implementer learns from this and it prevents them from making the mistake over and over again in the future. I think this can be worthwhile even for nitpicks. Ie if a two minute exchange leads to a minor improvement in the implementers code style over the course of the next two years.

Robin’s sense of honor would probably prevent him from litigating this, but that absolutely would not hold up in court.

I had also asked him personally; and I'll double-check before making any printed copies.

Have you verified with Robin that he is okay with this from a copyright standpoint?

9Taymon Beal2y
He tweeted his approval.
I don't consider it a core Star Wars movie; if I had, it would be just below revenge of the Sith.