All of NickGabs's Comments + Replies

I think you’re probably right. But even this will make it harder to establish an agency where the bureaucrats/technocrats have a lot of autonomy, and it seems there’s at least a small chance of an extreme ruling which could make it extremely difficult.

It might also make it easier. You can use the fact that Chevron was overruled to justify writing broad powers into the new AI safety regulation. 
Harder, yes; extremely, I'm much less convinced. In any case, Chevron was already dealt a blow in 2022, so those lobbying Congress to create an AI agency of some sort should be encouraged to explicitly give it a broad mandate (e.g. that it has the authority to settle various major economic or political questions concerning AI.)

Yeah, I think they will probably do better and more regulations than if politicians were more directly involved, but I’m not super sanguine about bureaucrats in absolute terms.

Why do you think this?

This was addressed in the post: "To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of "agentic") as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don't think these discussions have been very helpful; I think agency is probably a "you know it when you see it" kind of phenomenon. Additionally, even if we do need a more formal operat... (read more)

Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious ... (read more)

I think concrete ideas like this that take inspiration from past regulatory successes are quite good, esp. now that policymakers are discussing the issue.

I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models - Leopold thinks that governments and th... (read more)

How do any capabilities or motivations arise without being explicitly coded into the algorithm?

I don't think it is correct to conceptualize MLE as a "goal" that may or may not be "myopic."  LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don't intervene in their environment to make it more predictable.  When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra's idea of "HFDT," for a while after pre-training.  Thus, while pretraining from human preferences might shift the initial distrib... (read more)

Hm, that might be a potential point of confusion. I agree that there's no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons. You're right that it doesn't actively intervene, mostly because of the following: 1. There's no RL, usually. 2. It is memoryless, in the sense that it forgets itself. 3. It doesn't have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain. But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I'll give you examples: Completing buggy Python code in a buggy way Or to espouse views consistent with those expressed in the prompt (sycophancy). So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.
  1. How does it give the AI a myopic goal?  It seems like it's basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren't myopic agents, they aren't agents at all.  As such I'm not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers).  I think its plausible that
... (read more)
It's basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it's a simple, myopic goal, which prevents deceptive alignment. In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst. Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it's indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn't powerseek if it would be misaligned to a human's interest's? In particular, if the model doesn't try to get personal identifying information, then it's also voluntarily limiting it's ability to seek power when it detects that it's misaligned with a human's values. That's arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.

Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.

Yes, I'm talking about that technique known as Pretraining from Human Feedback. The biggest reasons I'm so optimistic about the technique, even with it's limitations, is the following: 1. It almost completely or completely solves deceptive alignment by giving it a myopic goal, so there's far less incentive or no incentive to be deceptive. 2. It scales well with data, which is extremely useful, that is the more data it has, the more aligned it will be. 3. The tests, while sort of unimportant from our perspective, gave tentative evidence for the proposition that we can control power seeking such that we can avoid having an AI power seek if it's misaligned and actually power seek only when it's aligned. 4. They dissolved, rather than resolved embedded agency/embedded alignment concerned by using offline learning. In particular, the AI can't hack or manipulate a human's values, unlike online learning. In essence they translated the ontology of Cartesianism and it's boundaries in a sensible way to an embedded world. It's not a total one shot solution, but it's the closest we came to a one shot-solution, and I can see a path to alignment that's fairly straightforward from here.

Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attain... (read more)

Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.

You're right that optimism bias is an issue, but optimism bias is generally an individual phenomenon, and the most important phenomenon is what people share instead of what they believe, so negative news being shared more is the most important issue. But recently we found a technique of alignment that solves almost every alignment problem in one go, and scales well with data.

Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.

In the intervening period, I've updated towards your position, though I still think it is risky to build systems with capabilities that open ended which are that close to agents in design space

I agree with something like this, though I think you're too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal.  If the model is sufficiently good at deception, there will be few to no differential adversarial examples.  Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is ... (read more)

We're talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.  Also, at this stage of the process, the model doesn't have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.  I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?

My understanding of Shard Theory is that what you said is true, except sometimes the shards "directly" make bids for outputs (particularly when they are more "reflexive," e. g. the "lick lollipop" shard is activated when you see a lollipop), but sometimes make bids for control of a local optimization module which then implements the output which scores best according to the various competing shards.  You could also imagine shards which do a combination of both behaviors.  TurnTrout can correct me if I'm wrong.

My takeoff speeds are on the somewhat faster end, probably ~a year or two from “we basically don’t have crazy systems” to “AI (or whoever controls AI) controls the world”

EDIT: After further reflection, I no longer endorse this. I would now put 90% CI from 6 months to 15 years with median around 3.5 years. I still think fast takeoff is plausible but now think pretty slow is also plausible and overall more likely.

Got it. To avoid derailing with this object level question, I’ll just say that I think it seems helpful to be explicit about takeoff speeds in macrostrategy discussions. Ideally, specifying how different strategies work over distributions of takeoff speeds.

I think I'm like >95th percentile on verbal-ness of thoughts.  I feel like almost all of my thoughts that aren't about extremely concrete things in front of me or certain abstract systems that are best thought of visually are verbal, and even in those cases I sometimes think verbally.  Almost all of the time, at least some words are going through my head, even if it's just random noise or song lyrics or something like that.  I struggle to imagine what it would be like to not think this way, as if I feel like many propositions can't be eas... (read more)

I was the same way, but I honestly do not feel a negative impact from skimming the useless noise off. You should try it! Just catch yourself when you're making short-term, ultimately unproductive observations. It helps to switch to thinking in a language you're less familiar with. Then if you wish, you can return to the super-verbal state of mind

Hmm... I guess I'm skeptical that we can train very specialized "planning" systems?  Making superhuman plans of the sort that could counter those of an agentic superintelligence seems like it requires both a very accurate and domain-general model of the world as well as a search algorithm to figure out which plans actually accomplish a given goal given your model of the world.  This seems extremely close in design space to a more general agent.  While I think we could have narrow systems which outperform the misaligned superintelligence in o... (read more)

Well, simulator type systems like GPT-3 do not become agents if amplified to superhuman cognition. Simulators could be used to generate/evaluate superhuman plans without being agents with independent objectives of their own.

This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it.  If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us.  The problem is that it can't get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can't get feedback from another intelligent grader/assessor reasoning in a ~a priori way.

I think this is a very good critique of OpenAI's plan.  However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback  in order to produce high quality alignment research.  As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do.  If we assume that sufficiently powerful language... (read more)

2Alex Flint1y
Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don't actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of. Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working. So overall I'm not saying that language models can't be generally intelligent, I'm saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).

I agree with most of these claims.  However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/doomy than it seems like you are.  I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.  

I think human history provides mixed evidence for the ability of such agents to take over the world.  While almost every human in history has failed to acc... (read more)

I specifically said a human with +12 SD g factor. I didn't actually consider what a superintelligence that was at that level on all domains would mean, but I don't think it would matter because of objection 4: by the time superhuman agents arrive, we would already have numerous superhuman non agentic AI, including systems specialised for planning/tactics/strategy. You'd need to make particular claims about how a superhuman agent performs in a world of humans amplified by superhuman non agents. It's very not obvious to me that they can win any ensuing cognitive arms race. I am sceptical that a superhuman agent /agency would easily attain decisive cognitive superiority to the rest of civilisation.

You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the... (read more)

Yes, I agree—this is why I say:

I think the NAH does a lot of work for interpretability of an AI's beliefs about things that aren't values, but I'm pretty skeptical about the "human values" natural abstraction.  I think the points made in this post are good, and relatedly, I don't want the AI to be aligned to "human values"; I want it to be aligned to my values.  I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well.  Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.

If your values don't happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards. On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that's actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.

Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA

Good point, post updated accordingly.

I think making progress on ML is pretty hard.  In order for a single AI to self improve quickly enough that it changed timelines, it would have to improve close to as fast as the speed at which all of the humans working on it could improve it.  I don't know why you would expect to see such superhuman coding/science capabilities without other kinds of superintelligence.

I think the world modelling improvements from modern science and IQ raising social advances can be analytically separated from changes in our approach to welfare. As for non consensual wireheading, I am uncertain as to the moral status of this, so it seems like partially we just disagree about values. I am also uncertain as to the attitude of Stone Age people towards this - while your argument seems plausible, the fact that early philosophers like the Ancient Greeks were not pure hedonists in the wireheading sense but valued flourishing seems like evidence against this, suggesting that favoring non consensual wireheading is downstream of modern developments in utilitarianism.

1Thane Ruthenis1y
Fair enough, I suppose. My point is more— Okay, let's put the Stone Age people aside for a moment and think about the medieval people instead. Many of them were religious and superstitious and nationalistic, as the result of being raised on the diet of various unwholesome ideologies. These ideologies often had their own ideas of "the greater good" that they tried to sell people, ideas no less un-nice than non-consensual wireheading. Thus, a large fraction of humanity for the majority of its history endorsed views that would be catastrophic if scaled up. I just assume this naturally extrapolates backwards to the Stone Age. Stone-age people had their own superstitions and spiritual traditions, and rudimentary proto-ideologies. I assume that these would also be catastrophic if scaled up. Note that I'm not saying that the people in the past were literally alien, to the extent that they wouldn't be able to converge towards modern moral views if we e. g. resurrected and educated one of them (and slightly intelligence-amplified them to account for worse nutrition, though I'm not as convinced that it'd be necessary as you), then let them do value reflection. But this process of "education" would need to be set up very carefully, in a way that might need to be "perfect". My argument is simply that if we granted godhood to one of these people and let them manage this process themselves, that will doom the light cone.

The claim about Stone Age people seems probably false to me - I think if Stone Age people could understand what they were actually doing (not at the level of psychology or morality, but at the purely "physical" level), they would probably do lots of very nice things for their friends and family, in particular give them a lot of resources.  However, even if it is true, I don't think the reason we have gotten better is because of philosophy - I think it's because we're smarter in a more general way.  Stone Age people were uneducated and had less good nutrition than us; they were literally just stupid. 

1Thane Ruthenis1y
Education is part of what I'm talking about. Modern humans iterate on the output of thousands of years of cultural evolution, their basic framework of how the world works is drastically different from the ancestral ones. Implicit and explicit lessons of how to care about people without e. g. violating their agency come part and parcel with it. At the basic level, why do you think that their idea of "nice things" would be nuanced enough to understand that, say, non-consensual wireheading is not a nice thing? Some modern people don't. Stone Age people didn't live a very comfortable life by modern standards, the experience of pleasure and escape from physical ailments would be common aspirations, while the high-cognitive-tech ideas of "self-actualization" would make no native sense to them. Why would a newly-ascended Stone Age god not assume that making everyone experience boundless pleasure free of pain forever is not the greatest thing there could possibly be? Would it occur to that god to even think carefully about whether such assumptions are right? Edit: More than that, ancient people's world-models are potentially much more alien and primitive than we can easily imagine. I refer you to the speculations in the section 2 here. The whole "voices of the gods" thing in the rest of the post is probably wrong, but I find it easy to believe that the basic principles of theory-of-mind that we take for granted are not something any human would independently invent. And if so, who knows what people without it would consider the maximally best way to be nice to someone?

Having had a similar experience, I strongly endorse this advice.  Actually optimizing for high quality relationships in modern society looks way different than following the social strategies that didn't get you killed in the EEA.

I think this is probably true; I would assign something like a 20% chance of some kind of government action in response to AI aimed at reducing x-risk, and maybe a 5-10% chance that it is effective enough to meaningfully reduce risk.  That being said, 5-10% is a lot, particularly if you are extremely doomy.  As such, I think it is still a major part of the strategic landspace even if it is unlikely.

Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?

I don’t think we have time before AGI comes to deeply change global culture.

This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.

1Thane Ruthenis2y
See my other reply.

Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.

This. Any realistic takeoff with classical computers cannot rely on simulating the world atomically for taking over the world thanks to Landauer limit being so bounding. It either has very good models of humans and deceptive capabilities (Which I think are likely), Or it doesn't win. You are postulating perpetual motion machines in AI form or you think Quantum Computers are likely to be practical this century.

This seems unlikely to be the case to me.  However, even if this is the case and so the AI doesn't need to deceive us, isn't disempowering humans via force still necessary?  Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it.  Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against h... (read more)

If I'm, say, building a dam, I do not particularly need to think about the bears which formerly lived in the flooded forest. It's not like the bears are clever enough to think "ah, it's the dam that's the problem, let's go knock it down". The bears are forced out and can't do a damn thing about it, because they do not understand why the forest is flooded. I wouldn't be shocked if humans can tell their metaphorical forest is flooding before the end. But I don't think they'll understand what's causing it, or have any idea where to point the nukes, or even have any idea that nukes could solve the problem. I mean, obviously there will be people yelling "It's the AI! We must shut it down!", but there will also be people shouting a hundred other things, as well as people shouting that only the AI can save us. This story was based on a somewhat different prompt (it assumed the AI is trying to kill us and that the AI doesn't foom to nanotech overnight), but I think the core mood is about right:
Basically this. A human that fights against powerful animals like gorillas, bears, lions or tigers will mostly lose melee fights without luck or outliers, and even being lucky means you've probably gotten seriously injured to the point that you would die if not treated. If the human thinks about it and brings a gun, the situation is reversed, with animals struggling to defeat humans barring outliers or luck. That's the power of thinking: Not to enhance your previous skills, but to gain all-new skills.

Check out CLR's research:  They are focused on answering questions like these because they believe that competition between AI's is a big source of s-risk

Thanks, I'll be sure to check them out

It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.

Language models develp really good world models… primarily of humans writing text on the internet. Who are consequentialist agents, and are not fully aligned (in the absence of effective law enforcement) to other humans.

This seems pretty plausible to me, but I suspect that the first AGIs will exhibit a different distribution of skills across cognitive domains than humans and may also be much less agentic. Humans evolved in environments where the ability to form and execute long term plans to accumulate power and achieve dominance over other humans was highly selected for. The environments in which the first AGIs are trained may not have this property. That doesn’t mean they won’t develop it, but they may very well not until they are more strongly and generally superintelligent,

They rely on secrecy to gain relative advantages, but absolutely speaking, openness increases research speed; it increases the amount of technical information available to every actor.

With regard to God specifically, belief in God is somewhat unique because God is supposed to make certain things good in virtue of his existence; the value of the things religious people value is predicated on the existence of God.  In contrast, the value of cake to the kid is not predicated on the actual existence of the cake.  

I think this is a good point and one reason to favor more CEV style solutions to alignment, if they are possible, rather than solutions which align make the values of the AI relatively "closer" to our original values.

Eh, CEV got rightly ditched as an actual solution to the alignment problem. The basic problem is it assumed that there was a objective moral reality, and we have little evidence of that. It's very possible morals are subjective, which outright makes CEV non-viable. May that alignment solution never be revived.

Or, the other way around, perhaps "values" are defined by being robust to ontology shifts.

This seems wrong to me.  I don't think that reductive physicalism is true (i. e. the hard problem really is hard), but if I did, I would probably change my values significantly.  Similarly for religious values; religious people seem to think that God has a unique metaphysical status such that his will determines what is right and wrong, and if no being with such a metaphysical status existed, their values would have to change.

5Thane Ruthenis2y
Suppose that there's a kid who is really looking forward to eating the cake his mother promised to bake for him this evening. You might say he values this cake that he's sure exists as he's coming back home after school. Except, he shortly learns that there's no cake: his mother was too busy to make it. Do his values change? Same with God. Religious people value God, okay. But if they found out there's no God, that doesn't mean they'd have to change their values; only their beliefs. They'd still be the kinds of people who'd value an entity like God if that entity existed. If God doesn't exist, and divine morality doesn't either, that'd just mean the world is less aligned with their values than they'd thought — like the kid who has less sweets than he'd hoped for. They'd re-define their policies to protect or multiply whatever objects of value actually do exist in the world, or attempt to create the valuable things that turned out not to exist (e. g., the kid baking a cake on his own). None of that involves changes to values.

How do you know what that is? You don't have the ability to stand outside the mind-world relationship and perceive it, any more than anything else. You have beliefs about the mind-world relationship, but they are all generated by inference in your mind. If there were some hard core of non-inferential knowledge about he ontological nature of reality, you might be able to lever it to gain more knowledge, but there isn't because because the same objections apply

I'm not making any claims about knowing what it is.  The OP's argument is that our normal dete... (read more)

For what it's worth I think there needs to be some clarification.  I didn't say our model is deterministic nor should it be or not. And my argument is not about whether the correct definition of knowledge should be "justified true belief". And unless I have had the wrong impression, I don't think Sean Carrol's focus is on the definition of knowledge either. Instead, it's about what should be considered "true". The usual idea of a theory being true if it faithfully describes an underlying objective physical reality (deterministic or not)  is problematic. It suffers the same pitfall of believing I am a Boltzmann brain. It is due to the dilemma that theories are produced and evaluated by worldly objects while their truth ought to be judged with "a view from nowhere", a fundamentally objective perspective.  Start reasoning by recognizing I am a particular agent, then you will not have this problem. I don't deny that. In fact, I think that is the solution to many paradoxes. But the majority of people would start reasoning from the "view from nowhere" and regard that as the only way. I think that is what has led people astray in many problems. Like decision paradoxes such as Newcomb, anthropics and to a degree, quantum interpretations. 
What was the first thing you said I disagreed with? I disagree with all of that. I disagree that the world is known to be deterministic. I disagree that there you can found epistemology on ontology. You don't know that the mind-world relationship works in a certain way absent having an epistemology that says to. I disagree that we have all the knowledge we want or need. I disagree that correlation is sufficient to solve the problem.

By "process," I don't mean internal process of thought involving an inference from perceptions to beliefs about the world, I mean the actual perceptual and cognitive algorithm as a physical structure in the world.  Because of the way the brain actually works in a deterministic universe, it ends up correlated with the external world.  Perhaps this is unknowable to us "from the inside," but the OP's argument is not about external world skepticism given direct access only to what we perceive, but rather that given normal hypotheses about how the bra... (read more)

How do you know what that is? You don't have the ability to stand outside the mind-world relationship and percieve it, any more than anything else. You have beliefs about the mind-world relationship, but they are all generated by inference in your mind. If there were some hard core of non-inferential knowledge about he ontological nature of reality, you might be able to lever it to gain more knowledge, but there isn't because because the same objections apply. We don't know that the universe is deterministic. You are confusing assumptions with knowledge. The point is about correspondence. Neither correlations nor predictive accuracy amount to correspondence to a definite ontology. We dont' want correlation, we want correspondence. Correlation isn't causation, and it isn't correspondence. Assuming the scientific model doesn't help, because the scientific model says that the way perceptions relate to the world is indirect, going through many intermediate causal stages. Since multiple things could possible give rise to the same perceptions, a unique cause (ie a definite ontology) can't be inferred from perception alone.

We get correspondence to reality through predictive accuracy; we can predict experience well using science because scientific theories are roughly isomorphic to the structures in reality that they are trying to describe.

We have no way of knowing that, because such isomporphism cannot be checked directly. Also, small increments in predictivity can be associated with major shifts in ontology. We do not know that we are on the final theory, and the next theory could have a different ontology, but only one extra significant digit of accuracy.

Yeah this is exactly right imo.  Thinking about good epistemics as about believing what is "justified" or what you have "reasons to believe" is unimportant/useless insofar as it departs from "generated by a process makes that the ensuing map correlate with the territory."  In the world where we don't have free will, but our beliefs are produced deterministically by our observations and our internal architecture in a way such that they are correlated with the world, we have all the knowledge that we need.

We don't have processes for ensuring correspondence to reality...what we have are processes for ensuring predictive accuracy, which is it not the same thing.

While this doesn't answer the question exactly, I think important parts of the answer include the fact that AGI could upload itself to other computers, as well as acquire resources (minimally money) completely through using the internet (e. g. through investing in stocks via the internet).  A superintelligent system with access to trillions of dollars and with huge numbers of copies of itself on computers throughout the world more obviously has a lot of potentially very destructive actions available to it than one stuck on one computer with no resources.

The common-man's answer here would presumably be along the lines of "so we'll just make it illegal for an A.I. to control vast sums of money long before it gets to owning a trillion — maybe an A.I. can successfully pass off as an obscure investor when we're talking tens of thousands or even millions, but if a mysterious agent starts claiming ownership of a significant percentage of the world GDP, its non-humanity will be discovered and the appropriate authorities will declare its non-physical holdings void, or repossess them, or something else sensible". To be clear I don't think this is correct, but this is a step you would need to have an answer for.