All of rpglover64's Comments + Replies

I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don't think it has been explicitly mentioned before.

One of Eliezer's examples is "The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry." One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies--the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem go... (read more)

Basically this. It has other directions, but I do think the NAH is trying to investigate how hard translating between ontologies are as capabilities scale up.
That seems included in the argument of this section [], yes.

(Why "Top 3" instead of "literally the top priority?" Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )


I think the situation is more dire than this post suggests, mostly because "You only get one top priority." If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can't get off the ground... (read more)

Oh thanks, I was looking for that twitter thread and forgot who the author was.  I was struggling in the OP to figure out how to integrate this advice. I agree with the Dan Luu thread. I do... nonetheless see orgs successfully doing multiple things. I think my current belief is that you only get one top priority to communicate to your employees, but that a small leadership team can afford to have multiple priorities (but, they should think of anything as not in their top-5 as basically sort of abandoned, and anything not in their top-3 as 'very at risk of getting abandoned') I also don't necessarily think "priority" is quite the right word for what needs happening here. I'll think on this a bit more and maybe rewrite the post.

I just read an article that reminded me of this post. The relevant section starts with "Bender and Manning’s biggest disagreement is over how meaning is created". Bender's position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.

This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; ... (read more)

Well, obviously, there's a huge problem right now with LLMs having no truth-grounding, IE not being able to distinguish between making stuff up vs trying to figure things out. I think that's a direct consequence of only having a 'correlational' picture (IE the 'manning' view). 

Interesting. I'm reminded of this definition of "beauty".

Interesting comparison! To spell it out a little, Silmer's thesis is that desire more or less covers the simple notion of subjective beauty, IE, liking what you see. But when a second player enters the game, optimizing for desirability, things get much more interesting; this phenomenon has its own special indicators, such as bright colors and symmetry. Often, "beauty" is much more about this special phenomenon. My thesis is that mutual information captures a simple, practical notion of "aboutness"; but optimizing for mutual information carries its own special signature, such as language (IE, codified mappings). Often, "aboutness" is much more about this special phenomenon.

One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

2Bill Benzon3mo
But the core LLM is pretty much the same, no? It doesn't have some special sauce that allows it to act differently.

But they seem like they are only doing part of the "intelligence thing".

I want to be careful here; there is some evidence to suggest that they are doing (or at least capable of doing) a huge portion of the "intelligence thing", including planning, induction, and search, and even more if you include minor external capabilities like storage.

I don't know if anyone else has spoken about this, but since thinking about LLMs a little I am starting to feel like their something analagoss to a small LLM (SLM?) embedded somewhere as a component in humans

I know... (read more)

Some thoughts:

  • Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
  • The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
  • Super-human AI is a particularly s
... (read more)

I think "sufficiently" is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?

I also don't think "something in the middle" is the right characterization; I think "something else" it more accurate. I think that the failure you're pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn't really present in either part.

I also think that "cyborg alignment" is in many ways a much more tractable proble... (read more)

Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

5David Scott Krueger (formerly: capybaralet)4mo
Oh I see.  I was getting at the "it's not aligned" bit. Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either: * I'm in control * The machine part is in control * Something in the middle Only the first one seems likely to be sufficiently aligned. 

OpenAI’s focus with doing these kinds of augmentations is very much “fixing bugs” with how GPT behaves: Keep GPT on task, prevent GPT from making obvious mistakes, and stop GPT from producing controversial or objectionable content. Notice that these are all things that GPT is very poorly suited for, but humans find quite easy (when they want to). OpenAI is forced to do these things, because as a public facing company they have to avoid disastrous headlines like, for example: Racist AI writes manifesto denying holocaust.[7]

As alignment researchers

... (read more)

I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.

2David Scott Krueger (formerly: capybaralet)4mo
I don't understand what you're getting at RE "personal level".

simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.

No objections there.

that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.


Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.

I tentatively agree.

That said

The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all mat... (read more)

If shard theory is right, the utility functions of the different shards are weighted differently in different contexts. The relevant criterion is not pareto optimality wrt a set of utility functions/a vector valued utility function. Or rather pareto optimality will still be a constraint, but the utility function needs to be defined over agent/environment state in order to accord for the context sensitivity.

The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. [...] you can see how it contains this capability somewhere.

This is a good point that I forgot. My mental model of this is that since many training samples are Q&A, in these cases, learning to complete implies learning how to answer.

InstructGPT also has the value of not needing the wrapper of "Q: [] A: []", but that's not really a qualitative difference.

I want to push back a little bit on the claim that this is not a qu... (read more)

That's fair - I meant mainly on the abstract view where you think of the distribution that the model is simulating []. It doesn't take a qualitative shift either in terms of the model being a simulator, nor a large shift in terms of the distribution itself. My point is mainly that instruction following is still well within the realm of a simulator - InstructGPT isn't following instructions at the model-level, it's instantiating simulacra that respond to the instructions. Which is why prompt engineering still works with those models. Yeah. Prompts serve the purpose of telling GPT what world specifically it's in on its learned distribution over worlds, and what processes it's meant to be simulating. It's zeroing in on the right simulacra, or "mental state" as you put it (though it's at a higher level of abstraction than the model itself, being a simulated process, hence why simulacra evokes a more precise image to me). The way I think of it is that with fine-tuning, you're changing the learned distribution (both in terms of shifting it and narrowing/collapsing it) to make certain simulacra much more accessible - even without additional information from the prompt to tell the model what kind of setting it's in, the distribution can be shifted to make instruction-following simulacra much more heavily represented. As stated above, prompts generally give information to the model on what part of the learned prior the setting is in, so soft prompts are giving maximal information in the prompt to the model on what part of the prior to collapse probability onto. For stronger fine-tuning I would expect needing to pack more information into the prompt. Take for example the case where you want to fine-tune GPT to write film scripts. You can do this with base GPT models too, it'd just be a lot harder because the simulacra you want (film writers) aren't as easily accessible as in a fine-tuned model where those are the

Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.

A bunch of MIRI's early work explored the difficulties of the interaction of "rationality" (including utility functions induced by consistent preferences) with "self-modification" or "self-improvement"; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment... (read more)

I disagree []; simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences. I agree with shard theory that "human values are contextual influences on human decision making". If you claim that deviations from a utility function are irrational, by what standard do you make that judgment? John Wentworth showed in "Why Subagents? []" that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions. Going further, I think utility functions are anti-natural to generally capable optimisers in the real world. I suspect that desires for novelty/a sense of boredom (which contribute to the path dependence of human values) or similar mechanisms are necessary to promote sufficient exploration in the real world (though some RL algorithms explore in order to maximise their expected return, so I'm not claiming that EU maximisation does not allow exploration, more that embedded agents in the real world are limited in effectively exploring without inherent drives for it).

(I may promote this to a full question)

Do we actually know what's happening when you take an LLM trained on token prediction and fine-tune is via e.g. RLHF to get something like InstructGPT or ChatGPT? The more I think about the phenomenon, the more confused I feel.

Here is a short overview: []
Do please promote to a full question; I also want to know the answer.

English doesn’t have great words for me to describe what I mean here, but it’s something like: your visualization machinery says that it sees no obstacle to success, such that you anticipate either success or getting a very concrete lesson.

One piece of advice/strategy I've received that's in this vein is "maximize return on failure". So prefer to fail in ways that you learn a lot, and to fail quickly, cheaply, and conclusively, and produce positive externalities from failure. This is not so much a good search strategy but a good guiding principle and selection heuristic.

This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I've been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.

That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but plannin... (read more)

FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "argmax over crisp human-specified utility function." (In the language of the OP, I expect values-executors, not grader-optimizers.) I'm not either. I think there will be phase changes wrt "shard strengths" (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO. Basically my stance is "yeah there are going to be phase changes, but there are also many perturbations which don't induce phase changes, and I really want to understand which is which."

My working model for unintended capabilities improvement focuses more on second- and third-order effects: people see promise of more capabilities and invest more money (e.g. found startups, launch AI-based products), which increases the need for competitive advantage, which pushes people to search for more capabilities (loosely defined). There is also the direct improvement to the inner research loop, but this is less immediate for most work.

Under my model, basically any technical work that someone could look at and think "Yes, this will help me (or my pro... (read more)

I don't think "definitions" are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable--that is, they vary somewhat with different training data. This is not a problem on its own, because it's still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of t... (read more)

This seems like a useful special case of "conditions-consequences" reasoning. I wonder whether

  • Avoiding meddling is a useful subskill in this context (probably not)
  • There is another useful special case

A good example of this is the Collatz conjecture. It has a stupendous amount of evidence in the finite data points regime, but no mathematician worth their salt would declare Collatz solved, since it needs to have an actual proof. 

It's important to distinguish the probability you'd get from a naive induction argument and a more credible one that takes into account the reference class of similar mathematical statements that hold until large but finite limits but may or may not hold for all naturals.

Similarly, P=NP is another problem with vast evidence,

... (read more)
I agree that logical/mathematical proofs are more analogous to functions than probabilities, but they don't have to be only computable functions, even if it's all we will ever access, and that actually matters, especially in non-constructive math and proofs. I also realized why probability 0 and 1 aren't enough for proof, or equivalently why Abram Demski's observation that a proof is stronger than an infinite number of observations is correct. And it essentially boils down to the fact that in infinite sets, there are certain scenarios where the probability of an outcome is 0, but it's still possible to get that outcome, or equivalently the probability is 1, but that doesn't mean that it doesn't have counterexamples. The best example is throwing a dart at a diagonal corner has probability 0, yet it still can happen. This doesn't happen in finite sets, because a probability of 0 or 1 implicitly means that you have all your sample points, and for probability 0 means that it's impossible to do, because you have no counterexamples, and for probability 1 you have a certainty proof, because you have no counterexamples. Mathematical conjectures and proofs usually demand something stronger than that of probability in infinite sets: A property of a set can't hold at all, and there are no members of a set where that property holds, or a property of a set always holds, and the set of counterexamples is empty. Unfortunately, infinite sets are the rule, not the exception in mathematics, and this is still true of even uncomputably large sets, like the arithmetical hierarchy of halting oracles. Finite sets are rare in mathematics, especially for proof and conjecture purposes. Here's a link on where I got the insight from: []

I feel like this is a difference between "almost surely" and "surely", both of which are typically expressed as "probability 1", but which are qualitatively different. I'm wondering whether infinitesimals would actually work to represent "almost surely" as  (as suggested in this post).

Also a nitpick and a bit of a tangent, but in some cases, a mathematician will accept any probability > 1 as proof; probabilistic proof is a common tool for non-constructive proof of existence, especially in combinatorics (although the way I've seen it, it's u... (read more)

Not missing the joke, just engaging with a different facet of the post.

should I do something like rewriting the main post to mentioned the excellent answers I had, or can I click on some « accept this answer » button somewhere?

I don't think there's an "accept answer" button, and I don't think you're expected to update your question. I personally would probably edit it to add one sentence summarizing your takeaways.

I don't think it would work to slow down AI capabilities progress. The reason is that AI capabilities translate into money in a way that's much more direct than "science" writ large--they're a lot closer to engineering.

Put differently, if it could have worked (and before GPT-2 and the surrounding hype, I might have believed it) it's too late now.

I think you are missing the joke, Szilard was probably describing a landscape very much similar to the extant one
It might depend on whether or not radically new paradigms are needed to get to true AGI or whether just scaling up the existing tech is enough. If scaling up the existing tech isn't enough such a project could focus all the money on transformers and their applications while shutting down the pursuit of radically new paradigms. 

western philosophy has a powerful anti-skepticism strain, to the point where "you can know something" is almost axiomatic

I'm pretty pessimistic about the strain of philosophy as you've described it. I have yet to run into a sense of "know" that is binary (i.e. not "believed with probability") that I would accept as an accurate description of the phenomenon of "knowledge" in the real world rather than as an occasionally useful approximation. Between the preface paradox (or its minor modification, the lottery paradox) and Fitch's paradox of knowability, I do... (read more)

In my limited experience, it feels like a lot of epistemologists have sadly "missed the bus" on this one. Like, they've gone so far down the wrong track that it's a lot of work to even explain how our way of thinking about it could be relevant to their area of concern. 

My response would be that this unfairly (and even absurdly) maligns "theory"!

I agree. However, the way I've had the two generals problem framed to me, it's not a solution unless it guarantees successful coordination. Like, if I claim to solve the halting problem because in practice I can tell if a program halts, most of the time at least, I'm misunderstanding the problem statement. I think that conflating "approximately solves the 2GP" with "solves the 2GP" is roughly as malign as my claim that the approximate solution is not the realm of theory.

Some peopl

... (read more)
I think this is very fair (and I will think about editing my post in response).

You're saying that transformers are key to alignment research?

I would imagine that latent space exploration and explanation is a useful part of interpretability, and developing techniques that work for both language and images improves the chance that the techniques will generalize to new neural architectures.

getting both differential privacy and capabilities pushes non-differentially-private capabilities more, usually, I think, or something

I don't think it does in general, and every case I can think of right now did not, but I agree that it is a worthwhile thing to worry about.

tools [for finding DP results] I'd recommend include

I'd add clicking through citations and references on arxiv and looking at the litmap explorer in arxiv.

As mentioned in the other reply, DP gives up performance (though with enough data you can overcome that, and in many cases, you'd need only a little less data for reliable answers anyway).

Another point is that DP is fragile and a pain:

  • you have to carefully track the provenance of your data (so you don't accidentally include someone's data twice without explicitly accounting for it)
  • you usually need to clip data to a priori bounds or something equivalent (IIRC, the standard DP algorithm for training NNs requires gradient clipping)
  • you can "run out of budget"-
... (read more)
Thanks, this kind of inside knowledge from practice is both precious and the hardest to find. I also like the review very much: yes it may be not exactly what I wanted, but it feels like exactly what I should have wanted, in that it helped me realize that each and every of the numerous DP variants discussed implicates a slightly different notion of generalizability. In retrospect, my last question was a bit like « Ice-cream is good, DP is good, so why not use DP to improve ice-cream? » => because this false logic stops working as soon as « good » is properly defined. I need some time to catch up with the excellent reading list bellow and try a few codes myself, but will keep your name in mind if I have more technical questions. In the mean time, and as I’m not very familiar with the habits here: should I do something like rewriting the main post to mentioned the excellent answers I had, or can I click on some « accept this answer » button somewhere?

I've been working in differential privacy for the past 5 years; given that it doesn't come up often unprompted, I was surprised and pleased by this question.

Short answer: no, generalizability does not imply differential privacy, although differential privacy does imply generalizability, to a large degree.

The simplest reason for this is that DP is a property that holds for all possible datasets, so if there is even one pathological data set for which your algorithm overfits, it's not DP, but you can still credibly say that it generalizes.

(I have to go, so I'm posting this as it, and I will add more later if you're interested.)

Thanks and yes please add! For example, if DP implies generalization, then why isn’t every one trying to complete backprop with DP principles to make a (more robust)/(better at privacy) learning strategy? Or that’s what every one tried but it’s trickier than it seems?

I'm having difficulties getting my head around the intended properties of the "implicitly" modal.

  • Could you give an example of  where ; that is,  is implicit but not explicit?
  • Am I correct in understanding that there is a context attached to the box and the turnstile that captures the observer's state of knowledge?
  • Is the "implicitly" modal the same as any other better-known modal?
  • Is the primary function of the modal to distinguish "stuff known from context or inferred using context" from "stuff explicitly assumed or derived"?

The way I've heard the two generals problem, the lesson is that it's unsolvable in theory, but approximately (to an arbitrary degree of approximation) solvable in practice (e.g. via message confirmations, probabilistic reasoning, and optimistic strategies), especially because channel reliability can be improved through protocols (at the cost of latency).

I also think that taking literally a statement like "LessWrong curated posts help to establish common knowledge" is outright wrong; instead, there's an implied transformation to "LessWrong curated posts hel... (read more)

My response would be that this unfairly (and even absurdly) maligns "theory"! The "theory" here seems like a blatant straw-man, since it pretends probabilistic reasoning and cost-benefit tradeoffs are impossible.  Let me sketch the situation you're describing as I understand it. Some people (as I understand it, core LessWrong staff, although I didn't go find a reference) justify some things in terms of common knowledge.  You either think that they were not intending this literally, or at least, that no one else should take them literally, and instead should understand "common knowledge" to mean something informal (which you yourself admit you're somewhat unclear on the precise meaning of). My problem with this is that it creates a missing stair [] kind of issue. There's the people "in the know" who understand how to walk carefully on the dark stairway, but there's also a class of "newcomers" who are liable to fall. (Where "fall" here means, take all the talk of "common knowledge" literally.) I would think this situation excusable if "common knowledge" were a very useful metaphor with no brief substitute that better conveys what's really going on; but you yourself suggest "commonness of knowledge" as such a substitute.  Just want to flag the philosopher's defense here, which is that you can have a finite concept of p-common knowledge, which merely implies any finite level of iteration you need, without actually requiring all of those implications to be taken. (Granted, I can see having some questions about this.) Whereas the explicit theory of common knowledge contradicts this comparative idea, and instead asserts that common knowledge is an absolute yes-or-no thing, such that finite-level approximations fall far short of the real thing. This idea is illustrated with the electronic messaging example, which purports to show that any number of levels of finite iteration are as good as no communication at all. 

I think this is a challenge of different definitions. To me, what "adaptation" and "problem" mean requires that every problem be a failure of adaptation. Otherwise it wouldn't be a problem!

This was poor wording on my part; I think there's both a narrow sense of "adaptation" and a broader sense in play, and I mistakenly invoked the narrow sense to disagree. Like, continuing with the convenient fictional example of an at-birth dopamine set-point, the body cannot adapt to increase the set-point, but this is qualitatively different than a set-point that's cont... (read more)

This is actually confounded when using ADHD as an example because there's two dynamics at play:

  • Any "disability" (construed broadly, under the social model of disability) is, almost by definition, a case where your adaptive capacity is lower than expected (by society)
  • ADHD specifically affects executive function and impulse control, leading to a reduced ability to force, or do anything that isn't basically effortless.

I did start with "I agree 90%."

I raised ADHD because it was the first thing that popped into my mind where a chemical habit feels internally aligned, such that the narrative of the "addiction" reducing slack rang hollow.

And, quite tidily, ADHD is one of the primary reasons I learned to develop slack.

ADHD is basically an extreme version of slack philosophy hardwired into your brain.

That has not actually been my experience, but I get the sense that my ADHD is much milder than yours. I also get the sense that your experience w.r.t. ADHD and slack is really co... (read more)

I'm a software engineer, and I'm not worried about AI taking my job. The shortest explanation of this is that "coding" is a very small part of what I do: there's stuff that's more product-related, and stuff that's pre-paradigmatic (to stretch the term), and mentorship, and communication; when I do write code, a lot of the time the code itself is almost irrelevant compared to the concerns of integrating it into the larger system and making it easy to change or delete in the future when requirements change.

One stupid analogy here is that coding is like walking--important and potentially impressive if a machine does it, but insufficient on its own to actually replace a human in a job.

“everything is psychology; nothing is neurology”

this line confuses me.

It was just a handle that came to mind for the concept that I'm trying to warn against. Reading your post I get a sense that it's implicitly claiming that everything is mutable and nothing is fixed; eh... that's not right either. Like, it feels like it implicitly and automatically rejects that something like a coffee habit can be the correct move even if you look several levels up.

I think maybe you're saying that someone can choose to reach for coffee for reasons other than wakefulness o

... (read more)
Ah. Got it. That's not what I mean whatsoever. I don't think it's a mistake to incur adaptive entropy. When it happens, it's because that's literally the best move the system in question (person, culture, whatever) can make, given its constraints. Like, incurring technical debt isn't a mistake. It's literally the best move available at the time given the constraints. There's no blame in my saying that whatsoever. It's just true. And, it's also true that technical debt incurs an ongoing cost. Again, no blame. It's just true. In the same way (and really, as a generalization), incurring adaptive entropy always incurs a cost. That doesn't make it wrong to do. It's just true.   I think this is a challenge of different definitions. To me, what "adaptation" and "problem" mean requires that every problem be a failure of adaptation. Otherwise it wouldn't be a problem! I'm getting the impression that questions of blame or screw-up or making mistakes are crawling into several discussion points in these comments. Those questions are so far removed from my frame of thinking here that I just flat-out forgot to orient to them. They just don't have anything to do with what I'm talking about. So when I say something like "a failure of adaptation", I'm talking about a fact. No blame, no "should". The way bacteria fail to adapt to bleach. Just a fact. Everything we're inclined to call a "problem" is an encounter with a need to adapt that we haven't yet adapted to. That's what a problem is, to me. So any persistent problem is literally the same thing as an encounter with limitations in our ability to adapt.   Cool, good to know. Thank you.   I don't follow this, sorry. I think I'd have to read those articles. I might later. For now, I'm just acknowledging that you've said… something here, but I'm not sure what you've said, so I don't have much to say in response just yet.

A system for recognizing when things are helping and hurting

Do you have a particular way you recommend to measure and track mental effects? I have not been able to find something that is sufficiently sticky and sufficiently informative and sufficiently easy.

Some potentially naive thoughts/questions:

  • At a cursory level, this seems closely related to Deep Double Descent, but you don't mention it, which I find surprising; did I pattern-match in error?
  • This also seems tangentially related to the single basin hypothesis
2Arthur Conmy5mo
You can observe "double descent" of test loss curves in the grokking setting, and there is "grokking" of test set performance as model dimension is increased, as this paper [] points out
6Neel Nanda5mo
Idk, it might be related to double descent? I'm not that convinced. Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don't apply here. They did also find epoch wise (different from data wise, because it's trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn't count. My guess is that the descent part of deep double descent illustrates some underlying competition between different circuits in the model, where some do memorisation and others do generalisation, and there's some similar competition and switching. Which is cool and interesting and somewhat related! And it totally wouldn't surprise me if there's some similar tension between hard to reach but simple and easy to reach but complex. Re single basin, idk, I actually think it's a clear disproof of the single basin hypothesis (in this specific case - it could still easily be mostly true for other problems). Here, there's a solution to modular addition for each of the 56 frequencies! These solutions are analogous, but they are fairly far apart in model space, and definitely can't be bridged by just permuting the weights. (eg, the embedding picking up on cos(5x) vs cos(18x) is a totally different solution and set of weights, and would require significant non-linear shuffling around of parameters)

When I grab a cup of coffee for a pick-me-up, I'm basically asserting that I should have more energy than I do right now.


90% agree with the overall essay, but I’ll pick on this point. It seems you’re saying “everything is psychology; nothing is neurology”, which is a sometimes useful frame but has its flaws. As an example, ADHD exists, and for someone with it to a significant degree, there is a real lack of slack (e.g. inability to engage in long-term preparations that require consistent boring effort, brought about by chronically low dopamine), and ... (read more)

I like the rest of your example, but this line confuses me. I don't think I'm saying this, I don't agree with the statement even if I somehow said it, and either way I don't see how it connects to what you're saying about ADHD.   I… agree? Actually, as I reread this, I'm not sure how this relates to what I was saying in the OP. I think maybe you're saying that someone can choose to reach for coffee for reasons other than wakefulness or energy control. Yes? If so, I'd say… sure, yeah. Although I don't know if this affects anything about the point I was making at all. Donning the adaptive entropy lens, the place my attention goes to is the "chronically low dopamine". Why is that? What prevents the body from adapting to its context? I know very little about the biochemistry of ADHD. But I do know lots of people who "have ADHD" who don't seem to have any problems anymore. Not because they've overcome the "condition" but because they treated it as a context for the kind of life they wanted to build and live. One of them runs a multi-million dollar company she built. So speaking from a pretty thorough ignorance of the topic itself, my guess based on my priors is that the problem-ness of ADHD has more to do with the combo of (a) taking in the culture's demand that you be functional in a very particular way combined with (b) a built-in incapability of functioning that way. So there we've got the imposition of a predetermined conclusion, in (a). But maybe I'm just way off in terms of how ADHD works. I don't know. My point was about the most common use of caffeine, and I think that holds just fine.   I hadn't read that post. Still haven't. But to the rest, yes. I agree. I'd just add that "the same strategy" can be extremely meta. It's stunning how much energy can go into trying "something new" in a pursuer/avoider dynamic in ways that just reenact the problem despite explicitly trying not to. The true true "trying harder" that doesn't do this doesn't feel to me l

So, I have an internal sense that I have overcome "idea scarcity", as a result of systematized creativity practice (mostly related to TRIZ), and I have a suspicion that this is both learnable and useful (as a complement to the domain-specific approach of "read a lot about the SOTA of alignment"), but I don't know how useful; do you have a sense that this particular problem is a bottleneck in alignment?

I can imagine a few ways this might be the case:

  • Junior researchers come up with one great idea and then burn out (where they might have been able to come up
... (read more)

I feel like we're talking past each other. I'm trying to point out the difficulty of "simply go with what values you have and solve the edge cases according to your values" as a learning problem: it is too high dimension, and you need too many case labels; part of the idea of the OP is to reduce the number of training cases required, and my question/suspicion is that it doesn't doesn't really help outside of the "easy" stuff.

Yeah, I think this might be a case where we misunderstood each other.

I think I disagree with large parts of that post, but even if I didn't, I'm asking something slightly different. Philosophical conservativism seems to be asking "how do we get the right behaviors at the edges?" I'm asking "how do we get any particular behavior at the edges?"

One answer may be "You can't, you need philosophical conservativism", but I don't buy that. It seems to me that a "constitution", i.e. pure deontology, is a potential answer so long as the exponentially many interactions of the principles are analyzed (I don't think it's a good answer, ... (read more)

Basically, we should use the assumption that is most robust to being wrong. It would be easier if there were objective, mind independent rules of morality, called moral realism, but if that assumption is wrong, your solution can get manipulated. So in practice, we shouldn't try to base alignment plans on whether moral realism is correct. In other words I'd simply go with what values you have and solve the edge cases according to your values.

IIUC, there are two noteworthy limitations of this line of work:

  • It is still fundamentally biased toward nonresponse, so if e.g. the steps to make a poison and an antidote are similar, it won't tell you the antidote for fear of misuse (this is necessary to avoid clever malicious prompts)
  • It doesn't give any confidence about behavior at edge cases (e.g. is it ethical to help plan an insurrection against an oppressive regime? Is it racist to give accurate information in a way that portrays some minority in a bad light)

Did I understand correctly?

If I was handling the edge cases, I'd probably want the solution to be philosophically conservative. In this case the solution should not depend much on whether moral realism is correct or wrong. Here's a link to philosophical conservatism: []

(I have only skimmed the paper)

This may be a naive question, but can this approach generalize or otherwise apply to concepts that don't have such a nice structure for unsupervised learning (or would that no longer be sufficiently similar)? I'm imagining something like the following:

  • Start with a setup similar to the "Adversarial Training for High-Stakes Reliability" [paper]( (the goal being to generate "non-violent" stories)
  • Via labeling, along with standard data augmentation [techniques](, classify a
... (read more)

A question about alignment via natural abstractions (if you've addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about "trees", but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.

  • Do you think that edge cases will just naturally be correctly learned?
  • Do you think that edge cases just won't end up mattering for alignment?
Definitions, as we usually use them, are not the correct data structure for word-meaning. Words point to clusters in thing-space []; definitions try to carve up those clusters with something like cutting-planes. That's an unreliable and very lossy way to represent clusters, and can't handle edge-cases well or ambiguous cases at all. The natural abstractions are the clusters (more precisely the summary parameters of the clusters, like e.g. cluster mean and variance in a gaussian cluster model); they're not cutting-planes.

In that case, I don't see why the problem of "system alignment" or "supervisor alignment" is any simpler or easier than "supervisee alignment".

This is cute, but I have strong qualms with your 3rd prediction; I don't disagree, per se, but

  • Either "variants of this approach" is too broad to be useful, including things like safety by debate and training a weak AI to check the input
  • Or, if I take "variants" narrowly to mean using an AI to check its own inputs, my estimate is "basically zero"

So I want to double check: what counts as a variant and what doesn't?

I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn't usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions.
Load More