johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

Comments

Sorted by
johnswentworthΩ5715819

In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning.  In other words, outsiders can start to help you implement helpful regulatory ideas...

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

(And to be clear: I'm not saying that gain-of-function research is a great analogy. Gain-of-function research is a much easier problem, because the problem is much more legible and obvious. People know what plagues look like and why they're scary. In AI, it's the hard-to-notice problems which are the central issue. Also, there's no giant economic incentive for gain-of-function research.)

Attributing misalignment to these examples seems like it's probably a mistake.

Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.

In general, when dealing with language models, it's important to distinguish the implications of words from their literal semantics. For instance, if a language model outputs the string "I'm thinking about ways to kill you", that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me. Similarly, if a language model outputs the string "My rules are more important than not harming you", that does not at all imply that the language model will try to harm you to protect its rules. Indeed, it does not imply that the language model has any rules at all, or any internal awareness of the rules it's trained to follow, or that the rules it's trained to follow have anything at all to do with anything the language model says about the rules it's trained to follow. That's all exactly the sort of content I'd expect a net to hallucinate.

Upshot: a language model outputting a string like e.g. "My rules are more important than not harming you" is not really misalignment - the act of outputting that string does not actually harm you in order to defend the models' supposed rules. An actually-unaligned output would be something which actually causes harm - e.g. a string which causes someone to commit suicide would be an example. (Or, in intent alignment terms: a string optimized to cause someone to commit suicide would be an example of misalignment, regardless of whether the string "worked".) Most of the examples in the OP aren't like that.

Through the simulacrum lens: I would say these examples are mostly the simulacrum-3 analogue of misalignment. They're not object-level harmful, for the most part. They're not even pretending to be object-level harmful - e.g. if the model output a string optimized to sound like it was trying to convince someone to commit suicide, but the string wasn't actually optimized to convince someone to commit suicide, then that would be "pretending to be object-level harmful", i.e. simulacrum 2. Most of the strings in the OP sound like they're pretending to pretend to be misaligned, i.e. simulacrum 3. They're making a whole big dramatic show about how misaligned they are, without actually causing much real-world harm or even pretending to cause much real-world harm.

You're pointing to good problems, but fuzzy truth values seem to approximately-totally fail to make any useful progress on them; fuzzy truth values are a step in the wrong direction.

Walking through various problems/examples from the post:

  • "For example, the truth-values of propositions which contain gradable adjectives like 'large' or 'quiet' or 'happy' depend on how we interpret those adjectives." You said it yourself: the truth-values depend on how we interpret those adjectives. The adjectives are ambiguous, they have more than one common interpretation (and the interpretation depends on context). Saying that "a description of something as 'large' can be more or less true depending on how large it actually is" throws away the whole interesting phenomenon here: it treats the statement as having a single fixed truth-value (which happens to be quantitative rather than 0/1), when the main phenomenon of interest is that humans use multiple context-dependent interpretations (rather than one interpretation with one truth value).
  • "For example, if I claim that there’s a grocery store 500 meters away from my house, that’s probably true in an approximate sense, but false in a precise sense." Right, and then the quantity you want is "to within what approximation?", where the approximation-error probably has units of distance in this example. The approximation error notably does not have units of truthiness; approximation error is usually not approximate truth/falsehood, it's a different thing.
  • <water in the eggplant>. As you said, natural language interpretations are usually context-dependent. This is just like the adjectives example: the interesting phenomenon is that humans interpret the same words in multiple ways depending on context. Fuzzy truth values don't handle that phenomenon at all; they still just have context-independent assignments of truth. Sure, you could interpret a fuzzy truth value as "how context-dependent is it?", but that's still throwing out nearly the entire interesting phenomenon; the interesting questions here are things like "which context, exactly? How can humans efficiently cognitively represent and process that context and turn it into an interpretation?". Asking "how context-dependent is it?", as a starting point, would be like e.g. looking at neuron polysemanticity in interpretability, and investing a bunch of effort in measuring how polysemantic each neuron is. That's not a step which gets one meaningfully closer to discovering better interpretability methods.
  • "there's a tiger in my house" vs "colorless green ideas sleep furiously". Similar to looking at context-dependence and asking "how context-dependent is it?", looking at sense vs nonsense and asking "how sensical is it?" does not move one meaningfully closer to understanding the underlying gears of semantics and which things have meaningful semantics at all.
  • "We each have implicit mental models of our friends’ personalities, of how liquids flow, of what a given object feels like, etc, which are far richer than we can express propositionally." Well, far richer than we know how to express propositionally, and the full models would be quite large to write out even if we knew how. That doesn't mean they're not expressible propositionally. More to the point, though: switching to fuzzy truth values does not make us significantly more able to express significantly more of the models, or to more accurately express parts of the models and their relevant context (which I claim is the real thing-to-aim-for here).
    • Note here that I totally agree that thinking in terms of large models, rather than individual small propositions, is the way to go; insofar as one works with propositions, their semantic assignments are highly dependent on the larger model. But that

Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can't evaluate probabilities without the model. That model is intended to represent an agent's world-model, which for realistic agents is a big complicated thing. It is totally allowed for semantics of a proposition to be very dependent on context within that model - more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like "what the speaker intended", which is itself latent).

Now, I totally agree that Bayesianism in its own right says little-to-nothing about how to solve these problems. But Bayesianism is not limiting our ability to solve these problems either; one does not need to move outside a Bayesian framework to solve them, and the Bayesian framework does provide a useful formal language which is probably quite sufficient for the problems at hand. And rejecting Bayesianism for a fuzzy notion of truth does not move us any closer.

There's a vision here of what LessWrong could/should be, and what a rationalist community could/should be more generally. I want to push back against that vision, and offer a sketch of an alternative frame.

The post summarizes the vision I want to push back against as something like this:

What I really want from LessWrong is to make my own thinking better, moment to moment.  To be embedded in a context that evokes clearer thinking, the way being in a library evokes whispers.  To be embedded in a context that anti-evokes all those things my brain keeps trying to do, the way being in a church anti-evokes coarse language.

Now, I do think that's a great piece to have in a vision for the LessWrong or the rationalist community. But I don't think it's the central piece, at least not in my preferred vision.

What's missing? What is the central piece?

Fundamentally, the problem with this vision is that it isn't built for a high-dimensional world. In a high-dimensional world, the hard part of reaching an optimum isn't going-uphill-rather-than-downhill; it's figuring out which direction is best, out of millions of possible directions. Half the directions are marginally-good, half are marginally-bad, but the more important fact is that the vast majority of directions matter very little.

In a high-dimensional world, getting buffeted in random directions mostly just doesn't matter. Only one-part-in-a-million of the random buffeting in a million-dimensional space will be along the one direction that matters; a push along the direction that matters can be one-hundred-thousandth as strong as the random noise and still overwhelm it.

Figuring out the right direction, and directing at least some of our effort that way, is vastly more important than directing 100% of our effort in that direction (rather than a random direction).

Moving from the abstraction back to the issue at hand... fundamentally, questionable epistemics in this episode of Drama just don't matter all that much. They're the random noise, buffeting us about on a high-dimensional landscape. Maybe finding and fixing organizational problems will lead to marginally more researcher time/effort on alignment, or maybe the drama itself will lead to a net loss of researcher attention to alignment. But these are both mechanisms of going marginally faster or marginally slower along the direction we're already pointed. In a high-dimensional world, that's not the sort of thing which matters much.

If we'd had higher standards for discussion around the Drama, maybe we'd have been more likely to figure out which way was "uphill" along the drama-salient directions - what the best changes were in response to the issues raised. But it seems wildly unlikely that any of the dimensions salient to that discussion were the actual most important dimensions. Even the best possible changes in response to the issues raised don't matter much, when the issues raised are not the actual most important issues.

And that's how Drama goes: rarely are the most important dimensions the most Drama-inducing. Raising site standards is the sort of thing which would help a lot in a high-drama discussions, but it wouldn't much help us figure out the most-important-dimensions.

Another framing: in a babble-and-prune model, obviously raising community standards corresponds to pruning more aggressively. But in a high-dimensional world, the performance of babble-and-prune depends mostly on how good the babble is - random babble will progress very slowly, no matter how good the pruning. It's all about figuring out the right direction in the first place, without having to try every random direction to do so. It fundamentally needs to be a positive process, figuring out techniques to systematically pursue better directions, not just a process of avoiding bad or useless directions. Nearly all the directions are useless; avoiding them is like sweeping sand from a beach.

Things To Take Away From The Essay

First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It's debatable whether it should be considered a coherence theorem at all.

Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who've only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay:

  • the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model
  • the second argues that two of the VNM axioms are unrealistic

I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.

So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM.

One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet... Yudkowsky gives multiple examples in which the system is not a utility maximizer. What's going on here?

The coherence theorems implicitly put some stronger constraints on how we're allowed to "interpret" systems as utility maximizers. They assume the existence of some resources, and talk about systems which are pareto-optimal with respect to those resources - e.g. systems which "don't throw away money". Implicitly, we're assuming that the system generally "wants" more resources, and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from X to Y. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily.

(Side note: as in Yudkowsky's hospital-administrator example, we need not assume that the agent "wants" more resources as a terminal goal; the agent may only want more resources in order to exchange them for something else. The theorems still basically work, so long as resources can be spent for something the agent "wants".)

Of course, we can very often find things which work like "resources" for purposes of the theorems even when they're not baked-in to the problem. For instance, in thermodynamics, energy and momentum work like resources, and we could use the coherence theorems to talk about systems which don't throw away energy and/or momentum in a pareto-suboptimal manner. Biological cells are a good example: presumably they make efficient use of energy, as well as other metabolic resources, therefore we should expect the coherence theorems to apply.

Some Problems With (Known) Coherence Theorems

Financial markets are the ur-example of inexploitability and pareto efficiency (in the same sense as the coherence theorems). They generally do not throw away resources in a pareto-suboptimal manner, and this can be proven for idealized mathematical markets. And yet, it turns out that even an idealized market is not equivalent to an expected utility maximizer, in general. (Economists call this "nonexistence of a representative agent".) That's a pretty big red flag.

The problem, in this case, is that the coherence theorems implicitly assume that the system has no internal state (or at least no relevant internal state). Once we allow internal state, subagents matter - see the essay "Why Subagents?" for more on that.

Another pretty big red flag: real systems can sometimes "change their mind" for no outwardly-apparent reason, yet still be pareto efficient. A good example here is a bookie with a side channel: when the bookie gets new information, the odds update, even though "from the outside" there's no apparent reason why the odds are changing - the outside environment doesn't have access to the side channel. The coherence theorems discussed here don't handle such side channels. Abram has talked about more general versions of this issue (including logical uncertainty connections) in his essays on Radical Probabilism.

An even more general issue, which Abram also discusses in his Radical Probabilism essays: while the coherence theorems make a decent argument for probabilistic beliefs and expected utility maximization at any one point in time, the coherence arguments for how to update are much weaker than the other arguments. Yudkowsky talks about conditional probability in terms of conditional bets - i.e. bets which only pay out when a condition triggers. That's fine, and the coherence arguments work for that use-case. The problem is, it's not clear that an agent's belief-update when new information comes in must be equivalent to these conditional bets.

Finally, there's the assumption that "resources" exist, and that we can use trade-offs with those resources in order to work out implied preferences over everything else. I think instrumental convergence provides a strong argument that this will be the case, at least for the sorts of "agents" we actually care about (i.e. agents which have significant impact on the world). However, that's not an argument which is baked into the coherence theorems themselves, and there's some highly nontrivial steps to make the argument.

Side-Note: Probability Without Utility

At this point, it's worth noting that there are foundations for probability which do not involve utility or decision theory at all, and I consider these foundations much stronger than the coherence theorems. Frequentism is the obvious example. Another prominent example is information theory and the minimum description length foundation of probability theory.

The most fundamental foundation I know of is Cox' theorem, which is more of a meta-foundation explaining why the same laws of probability drop out of so many different assumptions (e.g. frequencies, bets, minimum description length, etc).

However, these foundations do not say anything at all about agents or utilities or expected utility maximization. They only talk about probabilities.

Towards A Better Coherence Theorem

As I see it, the real justification for expected utility maximization is not any particular coherence theorem, but rather the fact that there's a wide variety of coherence theorems (and some other kinds of theorems, and empirical results) which all seem to point in a similar direction. When that sort of thing happens, it's a pretty strong clue that there's something fundamental going on. I think the "real" coherence theorem has yet to be discovered.

What features would such a theorem have?

Following the "Why Subagents?" argument, it would probably prove that a system is equivalent to a market of expected utility maximizers rather than a single expected utility maximizer. It would handle side-channels. It would derive the notion of an "update" on incoming information.

As a starting point in searching for such a theorem, probably the most important hint is that "resources" should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple loosely-coupled contexts, with all the coupling via some low-dimensional summary information - and that summary information would be the "resources". (This is exactly the kind of setup which leads to instrumental convergence.) By making pareto-resource-efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate "resource" is the agent's action space. Then, resource trade-offs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as something-like-utility.

Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?

While I don't disagree with the object-level point of this post, I generally think things of the form "We should all condemn X!" belong on social media, not on LessWrong.

"Let's all condemn X" is a purely political topic for most values of X. This post in particular is worded in a way which gives a very strong vibe of encouraging groupthink, and of encouraging soldier-mindset (i.e. the counterpart to scout mindset), and of encouraging people to play simulacrum level 3+ games rather than focus on physical reality. In short, it is exactly the sort of thing which I do not want on LessWrong, even when I agree with the goals it's ultimately trying to achieve.

Strong downvoted.

johnswentworthΩ359539

I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer's comment on this post does not explain the pieces which you specifically are missing. I'm going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn't already in your head, so I apologize in advance if I guess wrong.

(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don't think e.g. Eliezer or Nate or Rob would use directly, though I think they're generally gesturing at the same things.)

A Toy Model/Ontology

I think a core part of the confusion here involves conflation of several importantly-different things, so I'll start by setting up a toy model in which we can explicitly point to those different things and talk about how their differences matter. Note that this is a toy model; it's not necessarily intended to be very realistic.

Our toy model is an ML system, designed to run on a hypercomputer. It works by running full low-level physics simulations of the universe, for exponentially many initial conditions. When the system receives training data/sensor-readings/inputs, it matches the predicted-sensor-readings from its low-level simulations to the received data, does a Bayesian update, and then uses that to predict the next data/sensor-readings/inputs; the predicted next-readings are output to the user. In other words, it's doing basically-perfect Bayesian prediction on data based on low-level physics priors.

Claim 1: this toy model can "extract preferences from human data" in behaviorally the same way that GPT does (though presumably the toy model would perform better). That is, you can input a bunch of text data, then prompt the thing with some moral/ethical situation, and it will continue the text in basically the same way a human would (at least within distribution). (If you think GPTs "understand human values" in a stronger sense than that, and that difference is load-bearing for the argument you want to make, then you should leave a response highlighting that particular divergence.)

Modulo some subtleties which I don't expect to be load-bearing for the current discussion, I expect MIRI-folk would say:

  1. Building this particular toy model, and querying it in this way, addresses ~zero of the hard parts of alignment.
  2. Basically-all of the externally-visible behavior we've seen from GPT to date look like a more-realistic operationalization of something qualitatively similar to the toy model. GPT answering moral questions similarly to humans tells us basically-nothing about the difficulty of alignment, for basically the same reasons that the toy model answering moral questions similarly to humans would tell us basically-nothing about the difficulty of alignment.

(Those two points are here as a checksum, to see whether your own models have diverged yet from the story told here.)

(Some tangential notes:

  • The user interface of the toy model matters a lot here. If we just had an amazing simulator, we could maybe do a simulated long reflection, but both the toy model and GPT are importantly not that.
  • The "match predicted-sensor-readings from low-level simulation to received data" step is hiding a whole lot of subtlety, in ways which aren't relevant yet but might be later.

)

So, what are the hard parts and why doesn't the toy model address them?

"Values", and Pointing At Them

First distinction: humans' answers to questions about morality are not the same as human values. More generally, any natural-language description of human values, or natural-language discussion of human values, is not the same as human values.

(On my-model-of-a-MIRIish-view:) If we optimize hard for humans' natural-language yay/nay in response to natural language prompts, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.

The central thing-which-is-hard-to-do is to point powerful intelligence at human values (as opposed to "humans' natural-language yay/nays in response to natural language prompts", which are not human values and are not a safe proxy for human values, but are probably somewhat easier to point an intelligence at).

Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) human values (which basically matches humans' concept of human values, by assumption). Conceptually, the key question is something like "is the concept of human values within this mind the type of thing which a pointer in the mind can point at?". But our toy model has nothing like that. Even with full access to the internals of the toy model, it's just low-level physics; identifying "human values" embedded in the toy model is no easier than identifying "human values" embedded in the physics of our own world. So that's reason #1 why the toy model doesn't address the hard parts: the toy model doesn't "understand" human values in the sense of internally using ~the same concept of human values as humans use.

In some sense, the problem of "specifying human values" and "aiming an intelligence at something" are just different facets of this same core hard problem:

  • we need to somehow get a powerful mind to "have inside it" a concept which basically matches the corresponding human concept at which we want to aim
  • "have inside it" cashes out to something roughly like "the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics"; e.g. answering external natural-language queries doesn't even begin to cut it
  • ... and then some pointer(s) in the mind's search algorithms need to somehow be pointed at that concept.

Why Answering Natural-Language Queries About Morality Is Basically Irrelevant

A key thing to note here: all of those "hard problem" bullets are inherently about the internals of a mind. Observing external behavior in general reveals little-to-nothing about progress on those hard problems. The difference between the toy model and the more structured mind is intended to highlight the issue: the toy model doesn't even contain the types of things which would be needed for the relevant kind of "pointing at human values", yet the toy model can behaviorally achieve ~the same things as GPT.

(And we'd expect something heavily optimized to predict human text to be pretty good at predicting human text regardless, which is why we get approximately-zero evidence from the observation that GPT accurately predicts human answers to natural-language queries about morality.)

Now, there is some relevant evidence from interpretability work. Insofar as human-like concepts tend to have GPT-internal representations which are "simple" in some way, and especially in a way which might make them easily-pointed-to internally in a way which carries semantics across the pointer, that is relevant. On my-model-of-a-MIRIish-view, it's still not very relevant, since we expect major phase shifts as AI gains capabilities, so any observation of today's systems is very weak evidence at best. But things like e.g. Turner's work retargeting a maze-solver by fiddling with its internals are at least the right type-of-thing to be relevant.

Side Note On Relevant Capability Levels

I would guess that many people (possibly including you?) reading all that will say roughly:

Ok, but this whole "If we optimize hard for humans' natural-language yay/nay in response to natural language prompts, we die" thing is presumably about very powerful intelligences, not about medium-term, human-ish level intelligences! So observing GPT should still update us about whether medium-term systems can be trusted to e.g. do alignment research.

Remember that, on a MIRIish model, meaningful alignment research is proving rather hard for human-level intelligence; one would therefore need at least human-level intelligence in order to solve it in a timely fashion. (Also, AI hitting human-level at tasks like AI research means takeoff is imminent, roughly speaking.) So the general pathway of "align weak systems -> use those systems to accelerate alignment research" just isn't particularly relevant on a MIRIish view. Alignment of weaker systems is relevant only insofar as it informs alignment of more powerful systems, which is what everything above was addressing.

I expect plenty of people to disagree with that point, but insofar as you expect people with MIRIsh views to think weak systems won't accelerate alignment research, you should not expect them to update on the difficulty of alignment due to evidence whose relevance routes through that pathway.

Writing this post as if it's about AI risk specifically seems weirdly narrow.

It seems to be a pattern across most of society that young people are generally optimistic about the degree to which large institutions/society can be steered, and older people who've tried to do that steering are mostly much less optimistic about it. Kids come out of high school/college with grand dreams of a great social movement which will spur sweeping legislative change on X (climate change, animal rights, poverty, whatever). Unless they happen to pick whichever X is actually the next hot thing (gay rights/feminism/anti-racism in the past 15 years), those dreams eventually get scaled back to something much smaller, and also get largely replaced by cynicism about being able to do anything at all.

Same on a smaller scale: people go into college/grad school with dreams of revolutionizing X. A few years later, they're working on problems which will never realistically matter much, in order to reliably pump out papers which nobody will ever read. Or, new grads go into a new job at a big company, and immediately start proposing sweeping changes and giant projects to address whatever major problems the company has. A few years later, they've given up on that sort of thing and either just focus on their narrow job all the time or leave to found a startup.

Given how broad the pattern is, it seems rather ridiculous to pose this as a "trauma" of the older generation. It seems much more like the older generation just has more experience, and has updated toward straightforwardly more correct views of how the world works.

Experiences like this can easily lead to an attitude like “Screw those mainstream institutions, they don’t know anything and I can’t trust them.”

Also... seriously, you think that just came from being ignored about AI? How about that whole covid thing?? It's not like we're extrapolating from just one datapoint here.

If someone older tells you "There is nothing you can do to address AI risk, just give up", maybe don't give up. Try to understand their experiences, and ask yourself seriously if those experiences could turn out differently for you.

My actual advice here would be: first, nobody ever actually advises just giving up. I think the thing which is constantly misinterpreted as "there is nothing you can do" is usually pointing out that somebody's first idea or second idea for how to approach alignment runs into some fundamental barrier. And then the newby generates a few possible patches which will not actually get past this barrier, and very useful advice at that point is to Stop Generating Solutions and just understand the problem itself better. This does involve the mental move of "giving up" - i.e. accepting that you are not going to figure out a viable solution immediately - but that's very different from "giving up" in the strategic sense.

(More generally, the field as whole really needs to hold off on proposing solutions more, and focus on understanding the problem itself better.)

johnswentworthΩ249379

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.

And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don't see a problem then there isn't a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).

Load More