(Why "Top 3" instead of "literally the top priority?" Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )
I think the situation is more dire than this post suggests, mostly because "You only get one top priority." If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can't get off the ground...
I just read an article that reminded me of this post. The relevant section starts with "Bender and Manning’s biggest disagreement is over how meaning is created". Bender's position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.
This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; ...
One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.
But they seem like they are only doing part of the "intelligence thing".
I want to be careful here; there is some evidence to suggest that they are doing (or at least capable of doing) a huge portion of the "intelligence thing", including planning, induction, and search, and even more if you include minor external capabilities like storage.
I don't know if anyone else has spoken about this, but since thinking about LLMs a little I am starting to feel like their something analagoss to a small LLM (SLM?) embedded somewhere as a component in humans
I think "sufficiently" is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?
I also don't think "something in the middle" is the right characterization; I think "something else" it more accurate. I think that the failure you're pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn't really present in either part.
I also think that "cyborg alignment" is in many ways a much more tractable proble...
OpenAI’s focus with doing these kinds of augmentations is very much “fixing bugs” with how GPT behaves: Keep GPT on task, prevent GPT from making obvious mistakes, and stop GPT from producing controversial or objectionable content. Notice that these are all things that GPT is very poorly suited for, but humans find quite easy (when they want to). OpenAI is forced to do these things, because as a public facing company they have to avoid disastrous headlines like, for example: Racist AI writes manifesto denying holocaust.
As alignment researchers
I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.
simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.
No objections there.
that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.
Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.
I tentatively agree.
The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all mat...
The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. [...] you can see how it contains this capability somewhere.
This is a good point that I forgot. My mental model of this is that since many training samples are Q&A, in these cases, learning to complete implies learning how to answer.
InstructGPT also has the value of not needing the wrapper of "Q:  A: ", but that's not really a qualitative difference.
I want to push back a little bit on the claim that this is not a qu...
Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.
A bunch of MIRI's early work explored the difficulties of the interaction of "rationality" (including utility functions induced by consistent preferences) with "self-modification" or "self-improvement"; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment...
(I may promote this to a full question)
Do we actually know what's happening when you take an LLM trained on token prediction and fine-tune is via e.g. RLHF to get something like InstructGPT or ChatGPT? The more I think about the phenomenon, the more confused I feel.
English doesn’t have great words for me to describe what I mean here, but it’s something like: your visualization machinery says that it sees no obstacle to success, such that you anticipate either success or getting a very concrete lesson.
One piece of advice/strategy I've received that's in this vein is "maximize return on failure". So prefer to fail in ways that you learn a lot, and to fail quickly, cheaply, and conclusively, and produce positive externalities from failure. This is not so much a good search strategy but a good guiding principle and selection heuristic.
This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I've been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.
That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but plannin...
My working model for unintended capabilities improvement focuses more on second- and third-order effects: people see promise of more capabilities and invest more money (e.g. found startups, launch AI-based products), which increases the need for competitive advantage, which pushes people to search for more capabilities (loosely defined). There is also the direct improvement to the inner research loop, but this is less immediate for most work.
Under my model, basically any technical work that someone could look at and think "Yes, this will help me (or my pro...
I don't think "definitions" are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable--that is, they vary somewhat with different training data. This is not a problem on its own, because it's still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of t...
This seems like a useful special case of "conditions-consequences" reasoning. I wonder whether
A good example of this is the Collatz conjecture. It has a stupendous amount of evidence in the finite data points regime, but no mathematician worth their salt would declare Collatz solved, since it needs to have an actual proof.
It's important to distinguish the probability you'd get from a naive induction argument and a more credible one that takes into account the reference class of similar mathematical statements that hold until large but finite limits but may or may not hold for all naturals.
Similarly, P=NP is another problem with vast evidence,
I feel like this is a difference between "almost surely" and "surely", both of which are typically expressed as "probability 1", but which are qualitatively different. I'm wondering whether infinitesimals would actually work to represent "almost surely" as (as suggested in this post).
Also a nitpick and a bit of a tangent, but in some cases, a mathematician will accept any probability > 1 as proof; probabilistic proof is a common tool for non-constructive proof of existence, especially in combinatorics (although the way I've seen it, it's u...
Not missing the joke, just engaging with a different facet of the post.
should I do something like rewriting the main post to mentioned the excellent answers I had, or can I click on some « accept this answer » button somewhere?
I don't think there's an "accept answer" button, and I don't think you're expected to update your question. I personally would probably edit it to add one sentence summarizing your takeaways.
I don't think it would work to slow down AI capabilities progress. The reason is that AI capabilities translate into money in a way that's much more direct than "science" writ large--they're a lot closer to engineering.
Put differently, if it could have worked (and before GPT-2 and the surrounding hype, I might have believed it) it's too late now.
western philosophy has a powerful anti-skepticism strain, to the point where "you can know something" is almost axiomatic
I'm pretty pessimistic about the strain of philosophy as you've described it. I have yet to run into a sense of "know" that is binary (i.e. not "believed with probability") that I would accept as an accurate description of the phenomenon of "knowledge" in the real world rather than as an occasionally useful approximation. Between the preface paradox (or its minor modification, the lottery paradox) and Fitch's paradox of knowability, I do...
My response would be that this unfairly (and even absurdly) maligns "theory"!
I agree. However, the way I've had the two generals problem framed to me, it's not a solution unless it guarantees successful coordination. Like, if I claim to solve the halting problem because in practice I can tell if a program halts, most of the time at least, I'm misunderstanding the problem statement. I think that conflating "approximately solves the 2GP" with "solves the 2GP" is roughly as malign as my claim that the approximate solution is not the realm of theory.
You're saying that transformers are key to alignment research?
I would imagine that latent space exploration and explanation is a useful part of interpretability, and developing techniques that work for both language and images improves the chance that the techniques will generalize to new neural architectures.
getting both differential privacy and capabilities pushes non-differentially-private capabilities more, usually, I think, or something
I don't think it does in general, and every case I can think of right now did not, but I agree that it is a worthwhile thing to worry about.
tools [for finding DP results] I'd recommend include
I'd add clicking through citations and references on arxiv and looking at the litmap explorer in arxiv.
As mentioned in the other reply, DP gives up performance (though with enough data you can overcome that, and in many cases, you'd need only a little less data for reliable answers anyway).
Another point is that DP is fragile and a pain:
I've been working in differential privacy for the past 5 years; given that it doesn't come up often unprompted, I was surprised and pleased by this question.
Short answer: no, generalizability does not imply differential privacy, although differential privacy does imply generalizability, to a large degree.
The simplest reason for this is that DP is a property that holds for all possible datasets, so if there is even one pathological data set for which your algorithm overfits, it's not DP, but you can still credibly say that it generalizes.
(I have to go, so I'm posting this as it, and I will add more later if you're interested.)
I'm having difficulties getting my head around the intended properties of the "implicitly" modal.
The way I've heard the two generals problem, the lesson is that it's unsolvable in theory, but approximately (to an arbitrary degree of approximation) solvable in practice (e.g. via message confirmations, probabilistic reasoning, and optimistic strategies), especially because channel reliability can be improved through protocols (at the cost of latency).
I also think that taking literally a statement like "LessWrong curated posts help to establish common knowledge" is outright wrong; instead, there's an implied transformation to "LessWrong curated posts hel...
I think this is a challenge of different definitions. To me, what "adaptation" and "problem" mean requires that every problem be a failure of adaptation. Otherwise it wouldn't be a problem!
This was poor wording on my part; I think there's both a narrow sense of "adaptation" and a broader sense in play, and I mistakenly invoked the narrow sense to disagree. Like, continuing with the convenient fictional example of an at-birth dopamine set-point, the body cannot adapt to increase the set-point, but this is qualitatively different than a set-point that's cont...
This is actually confounded when using ADHD as an example because there's two dynamics at play:
I did start with "I agree 90%."
I raised ADHD because it was the first thing that popped into my mind where a chemical habit feels internally aligned, such that the narrative of the "addiction" reducing slack rang hollow.
And, quite tidily, ADHD is one of the primary reasons I learned to develop slack.
ADHD is basically an extreme version of slack philosophy hardwired into your brain.
That has not actually been my experience, but I get the sense that my ADHD is much milder than yours. I also get the sense that your experience w.r.t. ADHD and slack is really co...
I'm a software engineer, and I'm not worried about AI taking my job. The shortest explanation of this is that "coding" is a very small part of what I do: there's stuff that's more product-related, and stuff that's pre-paradigmatic (to stretch the term), and mentorship, and communication; when I do write code, a lot of the time the code itself is almost irrelevant compared to the concerns of integrating it into the larger system and making it easy to change or delete in the future when requirements change.
One stupid analogy here is that coding is like walking--important and potentially impressive if a machine does it, but insufficient on its own to actually replace a human in a job.
“everything is psychology; nothing is neurology”
this line confuses me.
It was just a handle that came to mind for the concept that I'm trying to warn against. Reading your post I get a sense that it's implicitly claiming that everything is mutable and nothing is fixed; eh... that's not right either. Like, it feels like it implicitly and automatically rejects that something like a coffee habit can be the correct move even if you look several levels up.
I think maybe you're saying that someone can choose to reach for coffee for reasons other than wakefulness o
A system for recognizing when things are helping and hurting
Do you have a particular way you recommend to measure and track mental effects? I have not been able to find something that is sufficiently sticky and sufficiently informative and sufficiently easy.
Some potentially naive thoughts/questions:
When I grab a cup of coffee for a pick-me-up, I'm basically asserting that I should have more energy than I do right now.
90% agree with the overall essay, but I’ll pick on this point. It seems you’re saying “everything is psychology; nothing is neurology”, which is a sometimes useful frame but has its flaws. As an example, ADHD exists, and for someone with it to a significant degree, there is a real lack of slack (e.g. inability to engage in long-term preparations that require consistent boring effort, brought about by chronically low dopamine), and ...
So, I have an internal sense that I have overcome "idea scarcity", as a result of systematized creativity practice (mostly related to TRIZ), and I have a suspicion that this is both learnable and useful (as a complement to the domain-specific approach of "read a lot about the SOTA of alignment"), but I don't know how useful; do you have a sense that this particular problem is a bottleneck in alignment?
I can imagine a few ways this might be the case:
I feel like we're talking past each other. I'm trying to point out the difficulty of "simply go with what values you have and solve the edge cases according to your values" as a learning problem: it is too high dimension, and you need too many case labels; part of the idea of the OP is to reduce the number of training cases required, and my question/suspicion is that it doesn't doesn't really help outside of the "easy" stuff.
I think I disagree with large parts of that post, but even if I didn't, I'm asking something slightly different. Philosophical conservativism seems to be asking "how do we get the right behaviors at the edges?" I'm asking "how do we get any particular behavior at the edges?"
One answer may be "You can't, you need philosophical conservativism", but I don't buy that. It seems to me that a "constitution", i.e. pure deontology, is a potential answer so long as the exponentially many interactions of the principles are analyzed (I don't think it's a good answer, ...
IIUC, there are two noteworthy limitations of this line of work:
Did I understand correctly?
(I have only skimmed the paper)
This may be a naive question, but can this approach generalize or otherwise apply to concepts that don't have such a nice structure for unsupervised learning (or would that no longer be sufficiently similar)? I'm imagining something like the following:
A question about alignment via natural abstractions (if you've addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about "trees", but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.
In that case, I don't see why the problem of "system alignment" or "supervisor alignment" is any simpler or easier than "supervisee alignment".
This is cute, but I have strong qualms with your 3rd prediction; I don't disagree, per se, but
So I want to double check: what counts as a variant and what doesn't?
I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don't think it has been explicitly mentioned before.
One of Eliezer's examples is "The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry." One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies--the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem go... (read more)