All of jacob_cannell's Comments + Replies

I'll start with a basic model of intelligence which is hopefully general enough to cover animals, humans, AGI, etc. You have a model-based agent with a predictive world model W learned primarily through self-supervised predictive learning (ie learning to predict the next 'token' for a variety of tokens), a planning/navigation subsystem P which uses W to approximately predict sample important trajectories according to some utility function U, a value function V which computes the immediate net expected discounted future utility of actions from current stat... (read more)

Thanks for the review. I think you've summarized the post fairly well; I'll try to clarify some parts of the underlying model and claims that may have been unclear in the post.

A properly designed simbox will generate clear evidence about whether an AGI will help other agents, or whether it will engage in world conquest.

These two are not mutually exclusive: a fully aligned agent may still 'conquer the world' to prevent unaligned agents from doing the same. The real test is then what the agent does after conquering.

There's a wide variety of threat mod

... (read more)
I was focused solely on testing alignment. I'm pretty confused about how much realism is needed to produce alignment. I guess I should have realized that, but it did not seem obvious enough for me to notice. I'm skeptical. I don't know whether I'll manage to quantify my intuitions well enough to figure out how much we disagree here. It seems likely that some AGIs would notice changes in behavior. I expect it to be hard to predict what they'll infer. But now that I've thought a bit more about this, I don't see any likely path to an AGI finding a way to resist the change in utility function.

I don't think of LLMs like GPT3 as agents that uses language; they are artificial linguistic cortices which can be useful to brains as (external or internal) tools.

I imagine that a more 'true' AGI system will be somewhat brain-like in that it will develop a linguistic cortex purely through embedded active learning in a social environment, but is much more than just that one module - even if that particular module is the key enabler for human-like intelligence as distinct from animal intelligence.

That's certainly true, but it seems like currently an unsol

... (read more)

Multiplayer minecraft may already be a complex enough environment for AGI, even if it is a 'toy' world in terms of visuals. Regardless even if AGI requires an environment with more realistic and complex physics such simulations are not expensive relative to AGI itself. "Lazy rendering" of the kind we'd want to use for more advanced sims does not have any inherent consistency tradeoff beyond those inherent to any practical approximate simulation physics.

Foundation text and vision models will soon begin to transform sim/games but that is mostly a separate ... (read more)

I agree minecraft is a complex enough environment for AGI in principle. Perhaps rich domain distinction wasn't the right distinction. It's more like whether there are already abstractions adapted to intelligence built into the environment or not, like human language. Game of Life is expressive enough to be an environment for AGI in principle too, but it's not clear how to go about that. That's certainly true, but it seems like currently an unsolved problem how to make sim-grown agents that learn a language from scratch. That's my point: brute force search such as evolutionary algorithms would require much more compute. In my view -- and not everyone agrees with this, but many do -- GPT is the only instance of (proto-) artificial general intelligence we've created. This makes sense because it bootstraps off human intelligence, including the cultural/memetic layer, which was forged by eons of optimization in rich multi agent environments. Self-supervised learning on human data is the low hanging fruit. Even more so if the target is not just "smart general optimizer" but something that resembles human intelligence in all the other ways, such as using something recognizable as language and more generally being comprehensible to us at all.

I don't see how training of VPT or EfficientZero was compute inefficient. In fact for self driving cars the exact opposite is true - training in simulation can be much more efficient then training in reality.

VPT and EfficientZero are trained in toy environments, and self driving cars sims are also low-dimensional hard-coded approximations of the deployment domain (which afaik does cause some problems for edge cases in the real world). The sim for training AGI will probably have to be a rich domain [], which is more computationally intensive to simulate and so will probably require lazy rendering like you say in the post, but lazy rendering runs into challenges of world consistency. Right now we can lazily simulate rich domains with GPT but they're difficult to program reliably and not autonomously stable (though I think they'll become much more autonomously stable soon). And the richness of current GPT simulations inherits from massive human datasets. Human datasets are convenient because you have some guaranteed samples of a rich and coherent world. GPTs bootstrap from the optimization done by evolution and thousands of years of culture compressing world knowledge and cognitive algorithms into an efficient code, language. Skipping this step it's a lot less clear how you'd train AGI, and it seems to me barring some breakthrough on the nature of intelligence or efficient ML it would have to be more computationally intensive to compensate for the disadvantage of starting tabula rasa.

The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).

There is already significant economic pressure on DL system towards being 'verbal' thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.

Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?

The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization ("argmax is a trap").

The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,

If you assume you've already completely failed then the how/why is less i... (read more)

Assuming softmax is important for competitiveness instead, I don't see why this argument doesn't go through with "argmax" replaced by "softmax" throughout (including the "argmax is a trap" section of the OP). I read your linked comment and post, and still don't understand. I wonder what the authors of the OP (or anyone else) think about this.

Yeah, the typical human is only partially aligned with the rest of humanity and only in a highly non uniform way, so you get the typical distribution of historical results when giving supreme power to a single human - with outcomes highly contingent on the specific human.

So if AGI is only as aligned as typical humans, we'll also probably need a heterogeneous AGI population and robust decentralized control structures to get a good multipolar outcome. But it also seems likely that any path leading to virtual brain-like AGI will also allow for selecting for altruism/alignment well outside the normal range.

Sure - if we got to AGI next year - but for that to actually occur you'd have to exploit most of the remaining optimization slack in both high level ML and low level algorithms. Then beyond that Moore's law is already mostly ended or nearly so depending on who you ask, and most of the easy obvious hardware arch optimizations are now behind us.

Clearly the former precedes the latter - assuming by 'primary asset' you mean that which we eventually release into the world.

No because of the generalized version of Amdhal's law, which I explored in "Fast Minds and Slow Computers".

The more you accelerate something, the slower and more limiting all it's other hidden dependencies become.

So by the time we get to AGI, regular ML research will have rapidly diminishing returns (and cuda low level software or hardware optimization will also have diminishing returns), general hardware improvement will be facing the end of moore's law, etc etc.

3Daniel Kokotajlo10d
I don't see why that last sentence follows from the previous sentences. In fact I don't think it does. What if we get to AGI next year? Then returns won't have diminished as much & there'll be lots of overhang to exploit.

Thus, in order to avoid deceptive alignment, we need to modify the training regime in such a way that it somehow actively avoids deceptively aligned models.

I wasn't thinking in these terms yet, but I reached a similar conclusion a while back and my mainline approach of training in simboxes is largely concerned without how to design test environments where you can train and iterate without the agents knowing they are in a sim training/test environment.

I also somewhat disagree with the core argument in that it proves too much about humans. Humans are app... (read more)

Hmm, humans do appear approximately aligned as long as they don't have definitive advantages. "Power corrupts" and all that. If you take an average "aligned" human and give them unlimited power and no checks and balances, the usual trope happens in real life.

My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer's curse).

You do not need to model human irrationality and it is generally a mistake to do so.

Consider a child who doesn't understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child's empowerment using their limited irrational world model. It is correct to use the AI's more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.

Likewise for the other scenarios.

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax),

Actual useful AGI will not be built from argmax, because it's not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncert... (read more)

Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax? The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI's decision procedure with softmax results in the same outcome, since the AI's estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option. Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?

It's pretty easy to make your references actual footnotes with google scholar links using either the markdown or html editor, and this really helps readers follow your references vs just mentioning them vaguely like "Deci and Ryan (1985)".

Of course due to evolution it seems like in theory, there should be "efficient markets" in producing more copies of oneself, so maybe there is a blocker there. But as I said it seems like that blocker doesn't really hold because we just had a pandemic.

Yeah basically this - there already is an efficient market for nanotech replicators. The most recent pandemic was only a minor blip in the grand scheme of things, it would take far more to kill humanity or seriously derail progress, and unaligned AGI would not want to do that anyway vs just soft covert takeover.

Mirror cells and novel viruses are well within 'boring' advanced biotech, which can be quite dangerous. My argument of implausibility was directed at sci-fi hard nanotech, like grey goo.

If I had to guess at why these counterarguments fall apart, then it's that unaligned AGI wouldn't design a pandemic by mistake, because a germ capable of causing pandemics would have to specifically be designed for targetting human biology?

That seems plausible. The risk is that an unaligned AGI could kill or weaken humanity through advanced biotech. I don't think thi... (read more)

Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the 'model free' value function.

Efficientzero uses all that, and like I said - it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ's update step would then co... (read more)

And the weights are getting streamed from external RAM, GPUs can't stream a matrix multiplication efficiently, as far as I'm aware.

Of course GPUs can and do stream a larger matrix multiplication from RAM - the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now - 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.

The CS-2 is only good at matrix-vector operations that fit in its SRAM cap... (read more)

2Amal 12d
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big

The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model."

This is almost a joke, because the equivalent GPU architecture has both greater total IO bandwidth to any external SSD/RAM array, and the massive near-die GPU RAM that can function as a cache for any streaming approa... (read more)

2Zach Furman13d
Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time. Cerebras says that [] the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there's no round-trip latency gunking up the works like a GPU has when it wants data from RAM. Cerebras claims [] that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.

I think using external empowerment as a simpler bound on the long term component of humanity's utility function is one of the more exciting developments in AI alignment. Your article has a number of interesting links to prior work that I completely missed.

Part of my reason for optimism is I have a model of how empowerment naturally emerges in efficient long term planning from compounding model uncertainty, and there is already pretty strong indirect evidence the brain uses empowerment. So that allows us to avoid the issue of modeling human utility funct... (read more)

not sure what you mean here. As a heuristic present in biases? or in split second reactions?

Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn't navigate to that future.

If this agent is s... (read more)

Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice -> more propensity to do the vice next time -> vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you're predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training. I'm not saying the AI won't care about reward at all. I think it'll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a "vice" (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the "global maximum" picture getting a wrench thrown in it. When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
I think I have some idea what TurnTrout might've had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor [] . This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn't have known to code into the reward calculator (i.e. searching in a part of the chess move tree that "looks promising" according to its value function, even if its model does not predict that a checkmate reward is close at hand).

But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism.

Well that's not quite right - otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn't necessarily 'spend time' on that in the blocking sense, it typically happens in para... (read more)

4Amal 13d
I am certainly not an expert, but I am still not sure about your claim that it's only good for running small models. The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model." ( , weight streaming). So they explicitly claim that it should perform well with large models. Furthermore, in their white paper (, they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days. This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 -> GPT-3 was - having colossal number of parameters and sparse paradigm ( I could imagine that sparse parameters deliver much stronger results than normal parameters, and this might change scaling laws a bit.

"Corrigibility" means making the AGI care about human values through the intermediary of humans — making it terminally care about "what agents with the designation 'human' care about". (Or maybe "what my creator cares about",

Interesting - that is actually what I've considered to be proper 'value learning': correctly locating and pointing to humans and their values in the agent's learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve ... (read more)

2Thane Ruthenis13d
I wasn't making a definitive statement on what I think people mean when they say "corrigibility", to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the "faithfulness" component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as "locate the model of [whoever is giving me the order]" in the AI's utility function, not necessarily referring to [humans] specifically). And building off that definition, if "value learning" is supposed to mean something different, then I'd define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it's what humans value, but just because. Again, I don't necessarily think that it's what most people mean by these terms most times — I would natively view both approaches to this as something like "value learning" as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I'd carve it under these two constraints.

Hard nanotech (the kind usually envisioned in sci-fi) may be physically impossible, and at the very least is extremely difficult. The type of nanotech that is more feasible is 1.) top-down lithography (ie chips), and 2.) bottom up cellular biology, or some combinations thereof.

Biological cells are already near optimal nanotech robots in both practical storage density and computational energy efficiency (landauer limit). Even a superintelligence, no matter how clever, will not be able to design nanobots that are vastly more generally capable than biologic... (read more)

It would make evolutionary sense for current cells to be near optimal, and therefore for there to not be much opportunity for biotech/nanotech to do big powerful stuff. However, I notice that this leaves me confused about two things. First, common rhetoric in the rationalist community is that this is a big risk. E.g. Robin Hanson advocated banning mirror cells [], and I regularly hear people suggest working on preventing pandemics as an x-risk, or talk about how gain-of-function research is dangerous. Secondly, there's the personal experience that we just had this huge pandemic thing. If existing biology exploits opportunities to their limit with there being no space for novel mechanisms to compete, then it seems like we shouldn't have had a pandemic. If I had to guess at why these counterarguments fall apart, then it's that unaligned AGI wouldn't design a pandemic by mistake, because a germ capable of causing pandemics would have to specifically be designed for targetting human biology?

You are correct. I should reword that a bit. I do think the brain-like compute basin is wide and likely convergent, but it's harder to show that it's necessarily a single unique basin; there could be other disjoint regions of high efficiency in architecture space.

This doesn't seem impressive compared to Nvidia's offerings.

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia and is unlikely to be competitive in compute/$; if it was competitive Cerebras would be advertising/boasting that miracle as loudly as they could. Instead they are focusing on this linear scaling thing, which isn't an external performance comparison at all.

The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of train... (read more)

2Zach Furman14d
I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism. Cerebras claims [] that to train "10 times faster you need 50 times as many GPUs." According to this logic what you really care about instead is probably training speed or training speedup per dollar. Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are "real" in the sense that training speed increases linearly with the PFLOPS. I'm not sure I totally have a good grasp on this, but isn't this the whole point of Andromeda's weight streaming system? Fast off-chip memory combined with high memory bandwidth on the chip itself? Not sure what would limit this to small models if weights can be streamed efficiently, as Cerebras claims. Even if I'm right, I'm not sure either of these points change the overall conclusion though. I'd guess Cerebras still isn't economically competitive or they'd be boasting it as you said.

Have you ever drived subconsciously on system 1 autopilot while your conscious system 2 is thinking about something else? But then if there is any hint of danger or something out of the ordinary that shifts your complete system 1 attention on driving. Imagine how unsafe/unrobust human drivers would be without the occasional fulll conscious system 2 focus.

The analogy isn't perfect, but current DL systems are more like system 1 which doesn't scale well to the edge case scenarios. There are some rare situations that require complex thought chain reasoning ... (read more)

To be clear I also am concerned, but at lower probability levels and mostly not about doom. The laughable part is the specific "our light cone is about to get ripped to shreds" by a paperclipper or the equivalent, because of an overconfident and mostly incorrect EY/LW/MIRI argument involving supposed complexity of value, failure of alignment approaches, fast takeoff, sharp left turn, etc.

I of course agree with Aaro Salosensaari that many of the concerned experts were/are downstream of LW. But this also works the other way to some degree: beliefs about AI ... (read more)

I think you reading way too much into the specific questionable wording of "tragically flawed". By that I meant that they are flawed in some of the key background assumptions, how that influences thinking on AI risk/alignment, and the consequent system wide effects. I didn't mean they are flawed at their surface level purpose - as rationalist self help and community foundations. They are very well written and concentrate a large amount of modern wisdom. But that of course isn't the full reason for why EY wrote them: they are part of a training funnel to produce alignment researchers.

That's not really what I'm saying: it's more like this community naturally creates nearby phyg-like attractors which take some individually varying effort to avoid. If you don't have any significant differences of opinion/viewpoint you may already be in the danger zone. There are numerous historical case examples of individuals spiraling too far in, if you know where to look.


If you want to talk about cults, just say "cult".

And your reply isn't a concrete reply to any of my points.

The quotes from LOGI clearly establish exactly where EY agrees with T&C, and the other quotes establish the relevance of that to the sequences. It's not like two separate brains wrote LOGI vs the sequences, and the other quotes establish the correspondence regardless.

This is not a law case where I'm critiquing some super specific thing EY said. Instead I'm tracing memetic influences: establishing what high level abstract brain/AI viewpoint cluster he was roughly in when he wrote the sequences, and how that influenced them. The quotes are pretty clear enough for that.

So it turns out I'm just too stupid for this high level critique. I'm only used to ones where you directly reference the content of the thing you're asserting is tragically flawed. In order to get across to less sophisticated people like me in the future, my advice is to independently figure out how this "memetic influence" got into the sequences and then just directly refute whatever content it tainted. Otherwise us brainlets won't be able to figure out which part of the sequences to label "not true" due to memetic influence, and won't know if your disagreements are real or made up for contrarianism's sake.

you need to actually point to a specific portion of the sequences that rests on these foundations you speak of,

I did.

My comment has 8 links. The first is a link to "Adaptation-Executers, not Fitness-Maximizers", which is from the sequences ("The Simple Math of Evolution"), and it opens with a quote from Tooby and Cosmides. The second is a link to the comment section from another post in that sequence (Evolutions Are Stupid (But Work Anyway)) where EY explicitly discusses T&C, saying:

They're certainly righter than the Standard Social Sciences Mod

... (read more)
None of which is a concrete reply to anything Eliezer said inside "Adaption Executors, not Fitness Optimizers", just a reply to what you extrapolate Eliezer's opinion to be, because he read Tooby and Cosmides and claimed they were somewhere in the ballpark in a separate comment. So I ask again: what portion of the sequences do you have an actual problem with?

Humans continuously pick their own training data and generally aren't especially aware of the implicit bias this causes and consequent attractor dynamics. This could be the only bias that really matters strongly, and ironically it is not especially recognized in the one community supposedly especially concerned about cognitive biases.

Our community seems to love treating people like mass-produced automatons with a fixed and easily assessable "ability" attribute.

Have you considered the implied contradiction between the "culturally broken community" you describe and the beliefs - derived from that same community - which you espouse below?

I was crying the other night because our light cone is about to get ripped to shreds. I'm gonna do everything I can to do battle against the forces that threaten to destroy us.

Your doom beliefs derive from this "culturally broken community" - you p... (read more)

experts knowledgeable in the relevant subject matters that would actually lead to doom find this laughable

This seems overstated; plenty of AI/ML experts are concerned. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Quoting from [1], a survey of researchers who published at top ML conferences:

The median respondent’s probability of x-risk from humans failing to control AI was 10%

Admittedly, that's a far cry from "the light cone is about to get ripped to shreds," but it's also pretty far from finding those concerns laughable. [Edited to add: another recent survey ... (read more)

4sudo -i16d
I really don't want to entertain this "you're in a cult" stuff. It's not very relevant to the post, and it's not very intellectually engaging either. I've dedicated enough cycles to this stuff.

Just noting some comment links for future reference:

The sequences are tragically flawed - based on some overconfident assumptions about the brain and AI which turned out to be incorrect.

CEV is probably against suicide, and so is human empowerment.

Evolution succeeded at alignment: humans are massively successful by IGF (inclusive genetic fitness) metrics, and some do actually optimize mentally for IGF.

Some quick comments based purely on the poster (which is probably the most important part of your funnel):

"Biological Anchors" is probably not a meaningful term for your audience.

We have a 50% chance of recreating that amount of relevant computation by 2060

This seems wrong in that we already have around brain training levels of computation now or will soon - far before 2060. The remaining uncertainty is over software/algorithms, not hardware. We already have the hardware or are about to.

Once AI is capable of ML programming, it could improve its alg

... (read more)
2Neil Crawford16d
Thanks, Jacob! This is helpful. I've made the relevant changes to my copy of the poster. Regarding the 'biological anchors' point, I intended to capture the notion that it is not just the level/amount of computation that matters by prefixing with the word 'relevant'. When expanding on that point in conversation, I am careful to point out that generating high levels of computation isn't sufficient for creating human-level intelligence. I agree with what you say. I also think you're right about the term "biological anchors" not being very meaningful to my audience. Given that, from my experience, many academics see the poster but don't ask questions, it's probably a good idea for me to substitute this term for another. Thanks!

Thanks, I partially agree so I'm going to start with the most probable crux:

  1. Empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.

I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the id... (read more)

so, my claim we should find the fact that the problem and the solution has a lot of research in common (studying powerful AI) is a weird interesting fact, and we should generally assume that the research that is involved in the problem won't particularly be helpful to research that is involved in the solution — for example, if FAS is made in a way that is pretty different from current AI or XAI.

This essay seemed straightforward until this last paragraph, because: the "problem and solution has alot of research in common" seems to directly contradict "res... (read more)

(i kinda made a mess out of that last paragraph; i've edited in a much more readable version of it) okay, yeah i didn't explain that super well. what i was trying to say was: we have found them to have a lot in common so far, in retrospect, but we shouldn't have expected that in advance and we shouldn't necessarily expect that in the future. this isn't to say that they'll have nothing in common, hopefully we can reuse current ML tech for at least parts of FAS. but i think that that low-probability miracle is still our best bet, and a bottleneck to saving the world — i think other solutions either need to get there too eventually, or are even harder to turn into FAS. (i say that with not super strong confidence, however)

So now after looking into the "last man" and "men without chests" concepts, I think the relevant quote from "men without chests" is at the end:

The Apostle Paul writes, “The aim of our charge is love that issues from a pure heart and a good conscience and a sincere faith (1 Timothy 1:5, ESV).” If followers of Christ live as people with chests—strong hearts filled with God’s truth—the world will take notice.

"Men without chests" are then pure selfish rational agents devoid of altruism/love. I agree that naively maximizing the empowerment of a single huma... (read more)

The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence.

Yes, this is still underappreciated in most alignment discourse, perhaps because power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking. For example human friendships increase long term optionality (more social invites, social support, dating and business opportunities, etc), so a human trading some wealth for activities that increase and strength... (read more)

I think some people use the term "power-seeking" to refer specifically to the negative connotations of the term (hacking into a data center, developing harmful bioweapons and deploying them to retain control, etc).
I totally agree that the choice of "power seeking" is very unfortunate because of the same reasons you describe. I don't think optionality is quite it, though. I think "consequentialist" or "goal seeking" might be better (or we could just stick with "instrumental convergence"--it at least has neutral affect). As for underappreciatedness, I think this is possibly true, though anecdotally at least for me I already strongly believed this and in fact a large part of my generator of why I think alignment is difficult is based on this. I think I disagree about leveraging this for alignment but I'll read your proposal in more detail before commenting on that further.

Yeah I'm definitely not sure, but I'm also doubting much anything I read on twitter will make me as sure about any of this as many others seem to be (although that thread is interesting). I have no stake in this saga; for all I know SBF is guilty of all those claims and more, but I just don't think the public info is very reliable in general and especially right at this moment.

What about this WSJ story?

Also, I suggest applying some Bayesianism: if FTX had not touched customer funds, wouldn't you expect to see FTX people defending against such accusations on Twitter and other media, e.g., replying to the tweet I linked and responding to journalists' requests for comments? I'm not seeing anything like that.

However, I think it is starting to look increasingly likely that, even if FTX's handling of its customer's money was not technically legally fraudulent, it seems likely to have been fraudulent in spirit.

Are you basing this accusation of fraud on that tweet (and related) from CZ?

If this was a mystery and I was the investigator, I'd focus on "follow the money" and ask "who has the most to gain?", and then notice that:

  1. FTX was pursuing a risky leveraged growth strategy and gaining on their main competitor: Binance/CZ
  2. FTX then became suddenly vulnerable du
... (read more)

Check out this Twitter thread if you're not sure that FTX did something seriously illegal and/or unethical.

ETA: It was written by “former Head of Institutional Sales at @ftx_official” who still had access to FTX internal Slack until very recently. And you can check out his followers to verify that it’s not an impersonation.

The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time.

Yes - often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as 'spending' empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should... (read more)

3Stephen Zhao17d
Thanks for your response - good points and food for thought there. One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that's why I linked the AvE paper as an example of what can happen). Also related is that it's tricky to define what the "reasonable future time cutoff" is. I don't think this is trivial to solve - use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money. If you use a hard coded x amount of time, then you have edge cases around your cut-off time. You might need a dynamic time cutoff then, and I don't think that's trivial to implement. I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue. Correct me if I am wrong here (as I may have misrepresented/misinterpreted the general gist of ideas and comments on this front) - I believe a key idea around human empowerment is that we can focus on maximally empowering humans - almost like human empowerment is a "safe" target for optimization in some sense. I disagree with this idea, precisely because examples like in AvE show that too much human empowerment can be bad. The critical point I wanted to get across here is that human empowerment is not a safe target for optimization. Also, the other key point related to the examples like the submarine, protest, and suicide is that empowerment can sometimes be in conflict with our reward/utility/desires. The suicide example is the best illustrator of this (and it seems not too far-fetched to imagine someone who wants to suicide, but can't, and then feels increasingly worse - which seems like quite a nightmare scenario to me). A

EY 2007/2008 was mostly wrong about the brain, AI, and thus alignment in many ways.

As an example, the EY/MIRI/LW conception of AI Boxing assumes you are boxing an AI that already knows 1.) you exist, and 2.) that it is in a box. These assumptions serve pedagogical purpose for a blogger - especially one attempting to impress people with boxing experiments - but they are hardly justifiable, and if you remove those arbitrary constraints it's obvious that perfect containment is possible in simulation sandboxes given appropriate knowledge/training constraints:... (read more)

Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.

Yeah. Maybe a more central thing to how our views are differing, is that I don't view training signals as identical to utility functions. They're obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system's goals in some way, but it won't be identical to the operation of writing some objective to an agent's utility function, and the non-identicality will become very relevant for a very intelligent system. Another thing to say, if you like the outer / inner alignment distinction: 1. Yes, if you have an agent that's competent to predict some feature X of the world "sufficiently well", and you're able to extract the agent's prediction, then you've made a lot of progress towards outer alignment for X; but 2. unfortunately your predictor agent is probably dangerous, if it's able to predict X even when asking about what happens when very intelligent systems are acting, and 3. there's still the problem of inner alignment (and in particular we haven't clarified utility functions -- the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal -- which we wouldn't need if we had the predictor-agent, but that agent is unsafe).

So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.

Of course that's just in a sim.

Translating the concept to the re... (read more)

I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn't need to understand code or human language. It simply searches for plans that maximize diamond tools.

But assuming you could ask that question through a suitable interface the SI understood - and given some reasons to trust that giving the correct answers is instrumentally rational for the SI - then yes I agree that should work.

Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately). My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn't seem to me to answer the question about utility functions. It doesn't explain how the code that's found, actually encodes the idea of diamonds and does its thinking in a way that's really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don't, so we, unlike the superintelligence, can't use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)

We are now far from your original objection " I don't even know how to make an agent with a clear utility function module".

Imperfect simulations work just fine - for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.

We're not far from there. There's inferential distance here. Translating my original statement, I'd say: the closest thing to the "utility function module" in the scenario you're describing here with MuZero, is the concept of predicted diamond and the AI it's inside of. But then you train another AI to pursue that. And I'm saying, I don't trust that that new trained AI actually maximizes diamond; and to the point, I don't have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don't understand it well enough to have any justified confidence it'll robustly pursue diamond.

The agent I described has the perfect model of it's environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world - there is no other agent more competent.

Learning a new domain (like a different sim environment) would require repeating all the steps.

which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean

The concept of predicted diamond doesn't understand anyt... (read more)

Would your point here w.r.t. utility functions be fairly summarizable as the following? I would agree with that statement.
In the real world, these domains aren't the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
Load More