Maybe I should've emphasized this more, but I think the relevant part of my post to think about is when I say
Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.
Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, L...
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn't represent much compute, then it doesn't matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation...
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It's ha...
I don't think the shift-enter thing worked. Afterwards I tried breaking up lines with special symbols IIRC. I agree that this capability eval was imperfect. The more interesting thing to me was the suspicion on Bing's part to a neutrally phrased correction.
I agree that there's an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing -- depending on facts about Bing's training.
Modally, I suspect Bing AI is misaligned in the sense that it's incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is n...
Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I've only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.
The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.
This is, of course...
I created a Manifold market on what caused this misalignment here: https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less?r=SmFjb2JQZmF1
Agree on points 3,4. Disagree on point 1. Unsure of point 2.
On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.
On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.
Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.
the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increas...
This is an empirical question, so I may be missing some key points. Anyway here are a few:
That post seems to mainly address high P(doom) arguments and reject them. I agree with some of those arguments and the rejection of high P(doom). I don't see as direct of a relevance to my previous comment. As for the broader point of self-selection, I think this is important, but cuts both ways: funders are selected to be competent generalists (and are biased towards economic arguments) as such they are pre-disposed to under-update on inside views. As an extreme case of this consider e.g. Bryan Caplan.
Here are comments on two of Nuno's arguments whi...
My deeply concerning impression is that OpenPhil (and the average funder) has timelines 2-3x longer than the median safety researcher. Daniel has his AGI training requirements set to 3e29, and I believe the 15th-85th percentiles among safety researchers would span 1e31 +/- 2 OOMs. On that view, Tom's default values are off in the tails.
My suspicion is that funders write-off this discrepancy, if noticed, as inside-view bias i.e. thinking safety researchers self-select for scaling optimism. My, admittedly very crude, mental model of an OpenPhil f...
In my opinion, the applications of prediction markets are much more general than these. I have a bunch of AI safety inspired markets up on Manifold and Metaculus. I'd say the main purpose of these markets is to direct future research and study. I'd phrase this use of markets as "A sub-field prioritization tool". The hope is that markets would help me integrate information such as (1) methodology's scalability e.g. in terms of data, compute, generalizability (2) research directions' rate of progress (3) diffusion of a given research direction through the re...
Ok seems like our understandings of ELK are quite different. I have in mind transformers, but not sure that it much matters. I'm making a question.
Intuitions are the results of the brain doing compression. Generally the source data which was compressed is no longer associated with the intuition. Hence from an introspective perspective, intuitions all appear equally valid.
Taking a third-person perspective, we can ask what data was likely compressed to form a given intuition. A pro sports players intuition for that sport has a clearly reliable basis. Our moral intuitions on population ethics are formed via our experiences in every day si...
Seems to me safety timeline estimation should be grounded by a cross-disciplinary, research timeline prior. Such a prior would be determined by identifying a class of research proposals similar to AI alignment in terms of how applied/conceptual/mathematical/funded/etc. they are and then collecting data on how long they took.
I'm not familiar with meta-science work, but this would probably involve doing something like finding an NSF (or DARPA) grant category where grants were made public historically and then tracking down what became of those lines of...
Can anyone point me to a write-up steelmanning the OpenAI safety strategy; or, alternatively, offer your take on it? To my knowledge, there's no official post on this, but has anyone written an informal one?
Essentially what I'm looking for is something like an expanded/OpenAI version of AXRP ep 16 with Geoffrey Irving in which he lays out the case for DM's recent work on LM alignment. The closest thing I know of is AXRP ep 6 with Beth Barnes.
In terms of decision relevance, the update towards "Automate AI R&D → Explosive feedback loop of AI progress specifically" seems significant to research prioritization. Under such a scenario, getting the automating AI R&D tools to be honest and transparent is more likely to be a pre-requisite for aligning TAI. Here's my speculation as to what automated AI R&D scenario implies for prioritization:
Candidates for increased priority:
Candidates for decreased priority:
An uninvestigated crux of the AI doom debate seems to be pessimism regarding current AI research agendas. For instance, I feel rather positive about ELK's prospects, but in trying to put some numbers on this feeling, I realized I have no sense of base rates for research program's success, nor their average time horizon. I can't seem to think of any relevant Metaculus questions either.
What could be some relevant reference classes for AI safety research program's success odds? Seems most similar to disciplines ...
Here's my stab at rephrasing this argument without reference to IB. Would appreciate corrections, and any pointers on where you think the IB formalism adds to the pre-theoretic intuitions:
At some point imitation will progress to the point where models use information about the world to infer properties of the thing they're trying to imitate (humans) -- e.g. human brains were selected under some energy efficiency pressure, and so have certain properties. The relationship between "things humans are observed to say/respond to" to "how the world works" is extr...
(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:
To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.
Here are a couple examples I ...
I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, __etc.__ and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research
This seems like a crux for the Paul-Eliezer disagreement wh...
Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.
Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case.
An analysis I'd like to see is:
My hypothesis: RLHF, and OpenAI work in general, has high capabilities impact. For other domains e.g. interpretability, preventing bad behavior, age...
Recently I've had success introducing people to AI risk by telling them about ELK, and specifically how human simulation is likely to be favored. Providing this or a similarly general argument, e.g. that power-seeking is convergent, seems both more intuitive (humans can also do simulation and power-seeking) and faithful to actual AI risk drivers to me than the video speed angle? ELK and power-seeking are also useful complements to specific AI risk scenarios.
The video speed framing seems to make the undesirable suggestion that AI ontology will be human-but-...
before they are developed radically faster by AI they will be developed slightly faster.
I see a couple reasons why this wouldn't be true:
First, consider LLM progress: overall perplexity increases relatively smoothly, particular capabilities emerge abruptly. As such the ability to construct a coherent Arxiv paper interpolating between two papers from different disciplines seems likely to emerge abruptly. I.e. currently asking a LLM to do this would generate a paper with zero useful ideas, and we have no reason to expect that the first GPT-N to be able...
I expect AI to look qualitatively like (i) "stack more layers,"... The improvements AI systems make to AI systems are more like normal AI R&D ... There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x)
Your view seems to implicitly assume that an AI with an understanding of NN research at the level necessary to contribute SotA results will not be able to leverage its...
I see three distinct reasons for the (non-)existence of terminal goals:
I. Disjoint proxy objectives
A scenario in which there seems to be reason to expect no global, single, terminal goal:
The OP doesn’t explicitly make this jump, but it’s dangerous to conflate the claims “specialized models seem most likely” and “short-term motivated safety research should be evaluated in terms of these specialized models”.
I agree with the former statement, but at the same time, the highest x-risk/ highest EV short-term safety opportunity is probably different. For instance, a less likely but higher impact scenario: a future code generation LM either directly or indirectly* creates an unaligned, far improved architecture. Researchers at the relevant org do...
The relationship between valence and attention is not clear to me, and I don't know of a literature which tackles this (though imperativist analyses of valence are related). Here are some scattered thoughts and questions which make me think there's something important here to be clarified:
Valence is of course a result of evolution. If we can identify precisely what evolutionary pressures incentivize valence, we can take an outside (non-anthropomorphizing, non-xenomorphizing) view: applying Laplace's rule gives us a 2/3 chance that AI developed with similar incentives will also experience valence?
How exactly does reward relate to valenced states in humans? In general, what gives rise to pleasure and pain, in addition to (or instead of) the processing of reward signals?
These problems seem important and tractable even if working out the full computational theory of valence might not be. We can distinguish three questions:
Answe...
A quadratic funding mechanism (similar to Gitcoin) could make sense for putting up distillation bounties. Quadratic funding (QF) lets a grant-maker put up a pool of matching funds while individual researchers specify how much each individual bounty would be valuable to her/him; then the matching is done via the QF strategy to optimize for aggregate researcher utility. Speaking for myself, I would contribute to a community fund for further distillations, and I would also be more likely to distill.
I find the level of distillation done by Daniel Filan at AXRP...
That certainly sounds scary, but seems unlikely in my case. No tox screen, but also did not buy locally in Berkeley, and had previously used the pills without problem.
Sleeping 22 hours a day for 2-3 days pre-admission and fever. I think the presumption was those sorts of symptoms merit careful investigation. Don't remember if there were any particular test results that were remarkable. IIRC there weren't.
Seeing as the modafinil was not prescription, and I've never heard of similar symptoms from others, it's quite plausible my pills were just contaminated with some other substance. Still should probably update against taking modafinil without prescription, since this contamination risk is just as important as side-effect symptoms.
I once had a multiple day hospitalization following use of modafinil (to prevent jetlag) during a flight -- checkups found no clear cause. This is obviously N=1, but still makes me wonder if there's some adverse interaction between modafinil and pressure changes. Would be interested if anyone has had similar experiences and/or knows of a relevant mechanism.
Here's one way of thinking about sleep which seems compatible with both the less-sleep-needed thesis and the lower-productivity-while-deprived observation: Some minimal amount of sleep provides a metabolic / cognitive role, and beyond this amount, additional hours of sleep were useful in the evolutionary context to save calories when the additional wakeful hours would not provide pay off.
If true, we'd expect there to a more-or-less fixed function from sleep quantity to sleepiness within the very low sleep range, but in the mid-sleep (5-8 hr?) range this fu...
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I'd like to see more predictions on my Metaculus question!
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn't improve performance), but I certainly didn't expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regar...
It's worth noting that Table 7 shows Github pre-training outperforming MassiveText (natural language corpus) pre-training. The AlphaCode dataset is 715GB compared to the 10TB of MassiveText (which includes 3TB of Github). I have not read the full details of both cleaning processes, but I assume that the cleaning / de-duplication process is more thorough in the case of the AlphaCode Github only dataset. EDIT: see also Algon's comment on this below.
I know of a few EAs who thought that natural language pre-training will continue to provide relevant performanc...
I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim.
I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or unif...
Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you're done. However, I don't see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game?
Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based...
Not sure I just pasted it. Maybe it's the referral link vs default url? Could also be markdown vs docs editor difference.