All of Jacob Pfau's Comments + Replies

Not sure I just pasted it. Maybe it's the referral link vs default url? Could also be markdown vs docs editor difference.

Maybe I should've emphasized this more, but I think the relevant part of my post to think about is when I say

Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.

Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, L... (read more)

What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn't represent much compute, then it doesn't matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.

For copying-errors, the copying operation... (read more)

Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.

It's ha... (read more)


I don't think the shift-enter thing worked. Afterwards I tried breaking up lines with special symbols IIRC. I agree that this capability eval was imperfect. The more interesting thing to me was the suspicion on Bing's part to a neutrally phrased correction.

I agree that there's an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing -- depending on facts about Bing's training.

Modally, I suspect Bing AI is misaligned in the sense that it's incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here

-> use-mention is n... (read more)

Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I've only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.

It looks like you didn't (and maybe can't) enter the ASCII art in the form Bing needs to "decode" it? For one, I'd expect line breaks, both after and before the code block tags and also between each 'line' of the art. If you can, try entering new lines with <kbd>Shift</kbd>+<kbd>Enter</kbd>. That should allow new lines without being interpreted as 'send message'.
3Sean Kane4mo
What was it supposed to say?

The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.

This is, of course... (read more)

Agree on points 3,4. Disagree on point 1. Unsure of point 2.

On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.

On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.

Ajeya has discussed situational awareness here.

You are correct regarding the training/deployment distinction.

Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.

2Charlie Steiner4mo
The calculations I mean are something like: * Activate a representation of the text of the spec of its data-cleaning process. This should probably be the same representation it could use to quote from the spec - if it needed some special pre-digested representation, that would make it a lot worse at generalization to untested parts of the spec. * Predicting its consequences. For this to be simple, it should be using some kind of general consequence-predictor rather than learning how to predict consequences from scratch. Specifically it needs some kind of sub-process that accepts inputs and has outputs ready at a broad range of layers in the network, and can predict the consequences of many things in parallel to be generally useful. If such a sub-process doesn't exist in some LLM, that's bad news for that LLM's ability to calculate consequences at runtime. * Represent the current situation in a way that's checkable against the patterns predicted by the spec. * And still have time left over to do different processing depending on the results of the check - the LLM has to have a general abstraction for what kind of processing it should be doing from here, so that it can generalize to untested implications of the spec. So that's all pretty ambitious. I think spreading the work out over multiple tokens requires those intermediate tokens to have good in-text reasons to contain intermediate results (as in "think step by step").

the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).

Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increas... (read more)

To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment []). From footnote 6 above:

This is an empirical question, so I may be missing some key points. Anyway here are a few:

  • My above points on Ajeya anchors and semi-informative priors
    • Or, put another way, why reject Daniel’s post?
  • Can deception precede economically TAI?
    • Possibly offer a prize on formalizing and/or distilling the argument for deception (Also its constituents i.e. gradient hacking, situational awareness, non-myopia)
  • How should we model software progress? In particular, what is the right function for modeling short-term return on investment to algorithmic progress?
    • My guess is th
... (read more)

That post seems to mainly address high P(doom) arguments and reject them. I agree with some of those arguments and the rejection of high P(doom). I don't see as direct of a relevance to my previous comment. As for the broader point of self-selection, I think this is important, but cuts both ways: funders are selected to be competent generalists (and are biased towards economic arguments) as such they are pre-disposed to under-update on inside views. As an extreme case of this consider e.g. Bryan Caplan. 

Here are comments on two of Nuno's arguments whi... (read more)

Yeah, I agree that the disagreement is probably more important to resolve, and I haven't much addressed that.

My deeply concerning impression is that OpenPhil (and the average funder) has timelines 2-3x longer than the median safety researcher. Daniel has his AGI training requirements set to 3e29, and I believe the 15th-85th percentiles among safety researchers would span 1e31 +/- 2 OOMs. On that view,  Tom's default values are off in the tails.

My suspicion is that funders write-off this discrepancy, if noticed, as inside-view bias i.e. thinking safety researchers self-select for scaling optimism. My,  admittedly very crude, mental model of an OpenPhil f... (read more)

What concrete cruxes would you most like to see investigated?
Basically, a rational reason to have longer timelines is the fact that there's a non-trivial chance that safety researchers are wrong due to selection effects, community epistemic problems, and overestimating the impact of AGI. Link below: []

In my opinion, the applications of prediction markets are much more general than these. I have a bunch of AI safety inspired markets up on Manifold and Metaculus. I'd say the main purpose of these markets is to direct future research and study. I'd phrase this use of markets as "A sub-field prioritization tool". The hope is that markets would help me integrate information such as (1) methodology's scalability e.g. in terms of data, compute, generalizability (2) research directions' rate of progress (3) diffusion of a given research direction through the re... (read more)

Ok seems like our understandings of ELK are quite different. I have in mind transformers, but not sure that it much matters. I'm making a question.

Since tailcalled seems reluctant, I'll make one. More can't hurt though.

Can you please make one for whether you think ELK will have been solved (/substantial progress has been made) by 2026? I could do it, but would be nice to have as many as possible centrally visible when browsing your profile.

EDIT: I have created a question here

I might create an account on Manifold Markets to make this question.
Hmm... My suspicion is that if ELK gets solved in practice, it will be by restricting the class of neural networks under consideration. Yet the ELK challenge adds a requirement that it has to work on just about any neural network.

When are intuitions reliable? Compression, population ethics, etc.

Intuitions are the results of the brain doing compression. Generally the source data which was compressed is no longer associated with the intuition. Hence from an introspective perspective, intuitions all appear equally valid.

Taking a third-person perspective, we can ask what data was likely compressed to form a given intuition. A pro sports players intuition for that sport has a clearly reliable basis. Our moral intuitions on population ethics are formed via our experiences in every day si... (read more)

Seems to me safety timeline estimation should be grounded by a cross-disciplinary, research timeline prior. Such a prior would be determined by identifying a class of research proposals similar to AI alignment in terms of how applied/conceptual/mathematical/funded/etc. they are and then collecting data on how long they took. 

I'm not familiar with meta-science work, but this would probably involve doing something like finding an NSF (or DARPA) grant category where grants were made public historically and then tracking down what became of those lines of... (read more)

2Esben Kran8mo
Very good suggestions. Funnily enough, our next report post will be very much along these lines (among other things []). We're also looking at inception-to-solution time for mathematics problems and for correlates of progress in other fields, e.g. solar cell efficiency <> amount of papers in photovoltaics research.  We'd also love to curate this data as you mention and make sure that everyone has easy access to priors that can help in deciding AI safety questions about research agenda, grant applications, and career path trajectory. Cross-posted on EAF

Can anyone point me to a write-up steelmanning the OpenAI safety strategy; or, alternatively, offer your take on it? To my knowledge, there's no official post on this, but has anyone written an informal one?

Essentially what I'm looking for is something like an expanded/OpenAI version of AXRP ep 16 with Geoffrey Irving in which he lays out the case for DM's recent work on LM alignment. The closest thing I know of is AXRP ep 6 with Beth Barnes.

In terms of decision relevance, the update towards "Automate AI R&D → Explosive feedback loop of AI progress specifically" seems significant to research prioritization. Under such a scenario, getting the automating AI R&D tools to be honest and transparent is more likely to be a pre-requisite for aligning TAI. Here's my speculation as to what automated AI R&D scenario implies for prioritization:

Candidates for increased priority:

  1. ELK for code generation
  2. Interpretability for transformers ...

Candidates for decreased priority:

  1. Safety of real wo
... (read more)

Research Agenda Base Rates and Forecasting

An uninvestigated crux of the AI doom debate seems to be pessimism regarding current AI research agendas. For instance, I feel rather positive about ELK's prospects, but in trying to put some numbers on this feeling, I realized I have no sense of base rates for research program's success, nor their average time horizon. I can't seem to think of any relevant Metaculus questions either. 

What could be some relevant reference classes for AI safety research program's success odds? Seems most similar to disciplines ... (read more)

Here's my stab at rephrasing this argument without reference to IB. Would appreciate corrections, and any pointers on where you think the IB formalism adds to the pre-theoretic intuitions:

At some point imitation will progress to the point where models use information about the world to infer properties of the thing they're trying to imitate (humans) -- e.g. human brains were selected under some energy efficiency pressure, and so have certain properties. The relationship between "things humans are observed to say/respond to" to "how the world works" is extr... (read more)

(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:

I. (Safety) Relevant examples of Z

To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology. 

Here are a couple examples I ... (read more)

I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, __etc.__ and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research

This seems like a crux for the Paul-Eliezer disagreement wh... (read more)

1Tor Økland Barstad1y
How useful AI-systems can be at this sort of thing after becoming catastrophically dangerous is also worth discussing more than is done at present. At least I think so. Between Eliezer and me I think maybe that's the biggest crux (my intuitions about FOOM are Eliezer-like I think, although AFAIK I'm more unsure/agnostic regarding that than he is). Obviously a more favorable situation if AGI-system is aligned before it could destroy the world. But even if we think we succeeded with alignment prior to superintelligence (and possible FOOM), we should look for ways it can help with alignment afterwards, so as to provide additional security/alignment-assurance. As Paul points out, verification will often be a lot easier than generation, and I think techniques that leverage this (also with superintelligent systems that may not be aligned) is underdiscussed. And how easy/hard if would be for an AGI-system to trick us (into thinking it's being helpful when it really wasn't) would depend a lot on how we went about things. Various potential ways of getting help for alignment while keeping "channels of causality" quite limited and verifying the work/output of the AI-system in powerful ways. I've started on a series about this:

Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.

Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case. 

Estimating how much safety research contributes to capabilities via citation counts

An analysis I'd like to see is:

  1. Aggregate all papers linked on Alignment Newsletter Database (public) - Google Sheets 
  2. For each paper, count what percentage of citing papers are also in ANDB vs not in ANDB (or use some other way of classifying safety vs not safety papers)
  3. Analyze differences by subject area / author affiliation

My hypothesis: RLHF, and OpenAI work in general, has high capabilities impact. For other domains e.g. interpretability, preventing bad behavior, age... (read more)

Recently I've had success introducing people to AI risk by telling them about ELK, and specifically how human simulation is likely to be favored. Providing this or a similarly general argument, e.g. that power-seeking is convergent, seems both more intuitive (humans can also do simulation and power-seeking) and faithful to actual AI risk drivers to me than the video speed angle? ELK and power-seeking are also useful complements to specific AI risk scenarios.

The video speed framing seems to make the undesirable suggestion that AI ontology will be human-but-... (read more)

before they are developed radically faster by AI they will be developed slightly faster.

I see a couple reasons why this wouldn't be true: 

First, consider LLM progress: overall perplexity increases relatively smoothly, particular capabilities emerge abruptly. As such the ability to construct a coherent Arxiv paper interpolating between two papers from different disciplines seems likely to emerge abruptly. I.e. currently asking a LLM to do this would generate a paper with zero useful ideas, and we have no reason to expect that the first GPT-N to be able... (read more)

I expect AI to look qualitatively like (i) "stack more layers,"... The improvements AI systems make to AI systems are more like normal AI R&D ... There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x) 

Your view seems to implicitly assume that an AI with an understanding of NN research at the level necessary to contribute SotA results will not be able to leverage its... (read more)

I'm classifying "optimized, partially binarized, spiking neural networks" as architecture changes. I expect those to be gradually developed by humans and to represent modest and hard-won performance improvements. I expect them to eventually be developed faster by AI, but that before they are developed radically faster by AI they will be developed slightly faster. I don't think interdisciplinarity is a silver bullet for making faster progress on deep learning. I don't think I understand the Metaculus questions precisely enough in order to predict on them; it seems like the action is in implicit quantitative distinctions: * In "Years between GWP Growth > 25% and AGI," the majority of the AGI definition is carried by a 2-hour adversarial Turing test. But the difficulty of this test depends enormously on the judges and on the comparison human. If you use the strongest possible definition of Turing test then I'm expecting the answer to be negative (though mean is still large and positive because it is extraordinarily hard for it to go very negative). If you take the kind of Turing test I'd expect someone to use in an impressive demo, I expect it to be >5 years and this is mostly just a referendum on timelines. * For "AI capable of developing AI software," it seems like all the action is in quantitative details of how good (/novel/etc.) the code is, I don't think that a literal meeting of the task definition would have a large impact on the world. * For "transformers to accelerate DL progress," I guess the standard is clear, but it seems like a weird operationalization---would the question already resolve positively if we were using LSTMs instead of transformers, because of papers like this one []? If not, then it seems like the action comes down to unstated quantitative claims about how good the architectures are. I think that transformers will work better than RNNs for these application

I see three distinct reasons for the (non-)existence of terminal goals:

I. Disjoint proxy objectives

A scenario in which there seems to be reason to expect no global, single, terminal goal:

  1. Outer loop pressure converges on multiple proxy objectives specialized to different sub-environments in a sufficiently diverse environment.
  2. These proxy objectives will be activated in disjoint subsets of the environment.
  3. Activation of proxy objectives is hard-coded by the outer loop. Information about when to activate a given proxy objective is under-determined at the inner
... (read more)

The OP doesn’t explicitly make this jump, but it’s dangerous to conflate the claims “specialized models seem most likely” and “short-term motivated safety research should be evaluated in terms of these specialized models”.

I agree with the former statement, but at the same time, the highest x-risk/ highest EV short-term safety opportunity is probably different. For instance, a less likely but higher impact scenario: a future code generation LM either directly or indirectly* creates an unaligned, far improved architecture. Researchers at the relevant org do... (read more)

Great point. I agree and should have said something like that in the post. To expand on this a bit more, studying these specialized models will be valuable for improving their robustness and performance. It is possible that this research will be useful for alignment in general, but it's not the most promising approach. That being said, I want to see alignment researchers working on diverse approaches.

The relationship between valence and attention is not clear to me, and I don't know of a literature which tackles this (though imperativist analyses of valence are related). Here are some scattered thoughts and questions which make me think there's something important here to be clarified:

  • There's a difference between a conscious stimulus having high saliency/intensity and being intrinsically attention focusing. A bright light suddenly strobing in front of you is high saliency, but you can imagine choosing to attend or not to attend to it. It seems to me
... (read more)

Valence is of course a result of evolution. If we can identify precisely what evolutionary pressures incentivize valence, we can take an outside (non-anthropomorphizing, non-xenomorphizing) view: applying Laplace's rule gives us a 2/3 chance that AI developed with similar incentives will also experience valence?

How exactly does reward relate to valenced states in humans? In general, what gives rise to pleasure and pain, in addition to (or instead of) the processing of reward signals?

These problems seem important and tractable even if working out the full computational theory of valence might not be. We can distinguish three questions:

  1. What is the high-level functional role of valence? (coarse-grained functionalism)
  2. What evolutionary pressures incentivized valenced experience?
  3. What computational processes constitute valence? (fine-grained functionalism)

Answe... (read more)

Very interesting! Thanks for your reply, and I like your distinction between questions: Can you elaborate on this? What is do attention concentration v. diffusion mean? Pain seems to draw attention to itself (and to motivate action to alleviate it). On my normal understanding of "concentration", pain involves concentration. But I think I'm just unfamiliar with how you / 'the literature' use these terms.

A quadratic funding mechanism (similar to Gitcoin) could make sense for putting up distillation bounties. Quadratic funding (QF) lets a grant-maker put up a pool of matching funds while individual researchers specify how much each individual bounty would be valuable to her/him; then the matching is done via the QF strategy to optimize for aggregate researcher utility. Speaking for myself, I would contribute to a community fund for further distillations, and I would also be more likely to distill.

I find the level of distillation done by Daniel Filan at AXRP... (read more)

One way to define such a bounty would be to define them via alignment forum karma. You could publish a list of papers to be distilled and define how much you would pay for an alignment forum post with 10, 25, and 50 alignment forum karma on the topic. 

That certainly sounds scary, but seems unlikely in my case. No tox screen, but also did not buy locally in Berkeley, and had previously used the pills without problem.

Sleeping 22 hours a day for 2-3 days pre-admission and fever. I think the presumption was those sorts of symptoms merit careful investigation. Don't remember if there were any particular test results that were remarkable. IIRC there weren't.

Seeing as the modafinil was not prescription, and I've never heard of similar symptoms from others, it's quite plausible my pills were just contaminated with some other substance. Still should probably update against taking modafinil without prescription, since this contamination risk is just as important as side-effect symptoms.

So you procured study drugs from an illicit source, took them, felt your body temp rise, stopped taking them, spent the next few days sleeping like crazy, and presented at the hospital? Did they do a tox screen (meth and similar stimulants?) I posted on another thread a while ago, according to dancesafe, counterfeit modafinil that's actually low dose methamphetamine was being marketed in Berkeley. I'd expect this to be common, because the following reasoning is a 'flash of inspiration' I'd expect a drug dealer to have... Nerds have money -> nerds want study drugs -> most study drugs are stimulants -> pill press is cheap -> meth is widely available -> dilute the meth doses you'd usually sell to tweakers, put in pill press to look like study drug, sell to nerds. Brilliant plan right?

I once had a multiple day hospitalization following use of modafinil (to prevent jetlag) during a flight -- checkups found no clear cause. This is obviously N=1, but still makes me wonder if there's some adverse interaction between modafinil and pressure changes. Would be interested if anyone has had similar experiences and/or knows of a relevant mechanism.

What were the symptoms that led you to go to the hospital, and what did they observe there that convinced them to have you stay after the initial exam?

Here's Chalmers defending his combinatorial state automata idea.

Thanks! Exactly what I was looking for :)

Here's one way of thinking about sleep which seems compatible with both the less-sleep-needed thesis and the lower-productivity-while-deprived observation: Some minimal amount of sleep provides a metabolic / cognitive role, and beyond this amount, additional hours of sleep were useful in the evolutionary context to save calories when the additional wakeful hours would not provide pay off.

If true, we'd expect there to a more-or-less fixed function from sleep quantity to sleepiness within the very low sleep range, but in the mid-sleep (5-8 hr?) range this fu... (read more)

Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I'd like to see more predictions on my Metaculus question!

I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn't improve performance), but I certainly didn't expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regar... (read more)

An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn't cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn't have the compute budget to do its approach optimally.

It's worth noting that Table 7 shows Github pre-training outperforming MassiveText (natural language corpus) pre-training. The AlphaCode dataset is 715GB compared to the 10TB of MassiveText (which includes 3TB of Github). I have not read the full details of both cleaning processes, but I assume that the cleaning / de-duplication process is more thorough in the case of the AlphaCode Github only dataset. EDIT: see also Algon's comment on this below.

I know of a few EAs who thought that natural language pre-training will continue to provide relevant performanc... (read more)

I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim.

I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or unif... (read more)

They chuck out larger files from Github (1MB+), or with lines longer than 1000 characters to exclude automatically generated code. I'm guessing the former is because programs that's too long just aren't useful when your model's context windows are tiny in comparison. They did also get rid of duplicates. Plus, they needed to avoid code published after the questions in the dataset were made, to avoid leaking the answer. As to natural language training, I suppose I'd agree that it is some evidence against the claim. But I'd say it would be strong evidence if they also trained Alphafold on MassiveText and found little to no performance increase. I wouldn't be surprised if it didn't do much though. Edit: Oh, I just saw Table 7. Yeah, that's pretty strong evidence against the claim that natural language corpuses are anywhere near as useful as code corpuses for this kind of stuff. But I am a little surpised that it added as much as it did.

Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you're done. However, I don't see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game?

Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based... (read more)

I agree that in certain conceivable games which are not baseline SC2, there will be different power-seeking incentives for negative alpha weights. My commentary wasn't intended as a generic takeaway about negative feature weights in particular.  But in the game which actually is SC2, where you don't start with a huge number of resources, negative alpha weights don't incentivize power-seeking. You do need to think about the actual game being considered, before you can conclude that negative alpha weighs imply such-and-such a behavior.  I think that either γ<<1 or considering suboptimal power-seeking [] resolves the situation. The reason that building infrastructure intuitively seems like power-seeking is that we are not optimal logically omniscient agents; all possible future trajectores do not lay out immediately before our minds. But the suboptimal power-seeking metric (Appendix C in Optimal Policies Tend To Seek Power []) does match intuition here AFAICT, where cleverly building infrastructure has the effect of navigating the agent to situations with more cognitively exploitable opportunities.
Load More