All Comments

How about an argument in the shape of: 

  1. we'll get good evidence of human-like alignment-relevant concepts/values well-represented internally (e.g. Scaling laws for language encoding models in fMRI, A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations); in addition to all the cumulating behavioral evidence
  2. we'll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. thorough evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
  3. we have some evidence for priors in favor of fine-tuning favoring strategies which make use of more accessible concepts, e.g. Predicting Inductive Biases of Pre-Trained Models, Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features.

⅓ of final projects involved evals/demos and ⅕ involved mechanistic interpretability, representing a large proportion of the cohort’s research interests.

this doesn't seem great in terms of pursuing a broad portfolio of approaches / seems to (partially) confirm worries about Goodhart-ing/overfocusing on projects with clearer feedback loops and legibility, at the detriment of more speculative and more neglected agendas

You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point "smartest human hacker", because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.

I'm pretty optimistic about our white box alignment methods generalizing fine.

And I still don't get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and "gradient descent is powerful" simply doesn't catch that.

Much better.

So, this could be an abstract at the beginning of the sequence, and the individual articles could approximately provide evidence for sentences in this abstract.

Or you could do it the Eliezer's way, and start with posting the articles that provide evidence for the individual sentences (each article containing its own summary), and only afterwards post an article that ties it all together. This way would allow readers to evaluate each article on its own merits, without being distracted by whether they agree or disagree with the conclusion.

It is possible that you have actually tried to do exactly this, but speaking for myself, I never would have guessed so from reading the original articles.

(Also, if your first article gets downvoted, please pause and reflect on that fact. Either your idea is wrong and readers express disagreement, or it is just really badly written and readers express confusion. In either case, pushing forward is not helpful.)

If someone accidentally uses “he” when they meant “she” or vice versa and when talking about a person who’s gender they know, it is likely because the speaker’s first language does not distinguish between he and she. This could be Finnish, Estonian, Hungarian and some Turkic languages and probably also other languages. I haven’t actually use it, but noticed it with a Finnish speaker.

What you see as a broken system, I see as a system working exactly as intended.

Should we keep any nonsense on LW front page just because the author asked us nicely?

Well, yes. I guess it's more of an... expression of frustration. Like telling the space-lizard-Jesus guy: "Dude, have you ever read the Bible?" You don't expect he did, and yes that is the reason why he says what he says... but you also do not really expect him to read it now.

(Then he asks you for help at publishing his own space Bible.)

Testing it on out of distribution examples seems helpful. If an AI still acts as if it follows human values out of distribution, it probably truly cares about human values. For AI with situational awareness, we can probably run simulations to an extent (and probably need bootstrap this after a certain capabilities threshold)

In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.

But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It I'd not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection remain secret is generally considered good practice.

There is probably more to be said about why the one is considered terrible practice and the other is considered good practice.

I have more examples, but unfortunately some of them I can't talk about.  A few random things that come to mind:

  • OpenPhil routinely requests that grantees not disclose that they've received an OpenPhil grant until OpenPhil publishes it themselves, which usually happens many months after the grant is disbursed.
  • Nearly every instance that I know of where EA leadership refused to comment on anything publicly post-FTX due to advice from legal counsel.
  • So many things about the Nonlinear situation.
  • Coordination Forum requiring attendees agree to confidentiality re: attendance and content of any conversations with people who wanted to attend but not have their attendance known to the wider world, like SBF, and also people in the AI policy space.

I have now also taken the 2023 organizer census.

If you don't have more examples, I think 

  1. it is too early to draw conclusions from OpenAI
  2. one special case doesn't invalidate the concept

Not saying your point is wrong, just that this is not convincing me.

I also find it plausible that the top 1-5 scholars are responsible for most of the impact, and we want to investigate this to a greater extent. Unfortunately, it's difficult to evaluate the impact of a scholar's research and career trajectory until more like 3-12 months after the program, so we decided to separate that analysis from the retrospective of the summer 2023 program.

We've begun collecting this type of information (for past cohorts) via alumni surveys and other sources and hope to have another report out in the next few months that closer tracks the impact that we expect MATS to have.

Yes, thanks!

I am familiar with some work from MIRI about that which focuses on Loebian obstacle, e.g. this 2013 paper: Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.

But I should look closer at other parts of those MIRI papers; perhaps there might be some material which actually establishes some invariants, at least for some simple, idealized examples of self-modification...

I think that short, critical comments can sometimes read as snarky/rude, and I don't want to speak that way to Nora. I also wanted to take some space to try to invoke the general approach to thinking about tribalism and show how I was applying it here, to separate my point from one that is only arguing against this particular tribal line that Nora is reifying, but instead to encourage restraint in general. Probably you're right that I could make it substantially shorter; writing concisely is a skill I want to work on.

I don't know who the "ai nihilists" are supposed to be. My sense is that you could've figured out from my comment objecting to playing and fast and loose with group names that I wouldn't think that phrase carved reality and that I wasn't sure who you have in mind!

You apparently completely misunderstood the point we were making with the white box thing.

 

I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics

One major intuition pump I think important: evolution doesn't get to evaluate everything locally. Gradient descent does. As a result, evolution is slow to eliminate useless junk though it does do so eventually. Gradient descent is so eager to do it that we call it catastrophic forgetting.

Gradient descent wants to use everything in the system for whatever it's doing, right now.

I disagree with the optimists that this makes it trivial because to me it appears that the dynamics that make short term misalignment likely are primarily organizational among humans - the incentives of competition between organizations and individual humans. Also RL-first ais will inline those dynamics much faster than RLHF can get them out.

I am appreciative of folks like yourself Nora and Quintin building detailed models of the alignment problem and presenting thoughtful counterarguments to existing arguments about the difficulty. I think anyone would consider it a worthwhile endeavor regardless of their perspective on how hard the problem is, and wish you good luck in your efforts to do so.

In my culture, people understand and respect that humans can easily trick themselves into making terrible collective decisions by tribal dynamics. They respond to this in many ways, such as by working to avoid making it a primary part of people's work or of people's attention, and also by making sure to not accidentally trigger tribal dynamics by inventing tribal decisions that didn't formerly exist but get picked up by the brain and thunk into being part of our shared mapmaking. It is generally considered healthy to spend most of our attention on understanding the world, solving problems, and sharing arguments, rather than making political evaluations about which group one is a member of. People are also extra hesitant about creating groups that exist fundamentally in opposition to other groups.

My current belief is that the vast majority of the people who have thought about the impacts and alignment of advanced AI (academics like Geoffrey Hinton, forecasters like Phil Tetlock, rationalists like Scott Garrabrant, and so forth) don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world. I have a spidey-sense that language like this often tries to make itself become true by saying it is true, and is good at getting itself into people's monkey brains and inventing tribal lines between friends where formerly there were none.

I think that the existing so-called communities (e.g. "Effective Altruism" or "Rationality" or "Academia") are each in their own ways bereft of some essential qualities for functioning and ethical people and projects. This does not mean that if you or I create new ones quickly they will be good or even better. I do ask that you take care to not recklessly invent new tribes that have even worse characteristics than those that already exist.

From my culture to yours, I would like to make a request that you exercise restraint on the dimension of reifying tribal distinctions that did not formerly exist. It is possible that there are two natural tribes here that will exist in healthy opposition to one another, but personally I doubt it, and I hope you will take time to genuinely consider the costs of greater tribalism.

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods).

I have the opposite expectation there; I think it's just that current methods are pretty primitive.

I appreciate you making an unambiguously bad post so that your unfinished thoughts can be critiqued. I look forward to the good post made from updating on comments.

Edit: this is not sarcasm and I mean it to be an actual statement of approval. Please actually do this more. This comes from a philosophy that one should post often enough to be heavily downvoted sometimes, to <?avoid confirmation bias?>

Overall take: unimpressed.

Very simple gears in a subculture's worldview can keep being systematically misperceived if it's not considered worthy of curious attention. On the local llama subreddit, I keep seeing assumptions that AI safety people call for never developing AGI, or claim that the current models can contribute to destroying the world. Almost never is there anyone who would bother to contradict such claims or assumptions. This doesn't happen because it's difficult to figure out, this happens because the AI safety subculture is seen as unworthy of engagement, and so people don't learn what it's actually saying, and don't correct each other on errors about what it's actually saying.

This gets far worse with more subtle details, the standard of willingness to engage is raised higher to actually study what the others are saying, when it would be difficult to figure out even with curious attention. Rewarding engagement is important.

Thanks for writing this! I’m curious if you have any information about the following questions:

  1. What does the MATS team think are the most valuable research outputs from the program?

  2. Which scholars was the MATS team most excited about in terms of their future plans/work?

IMO, these are the two main ways I would expect MATS to have impact: research output during the program and future research output/career trajectories of scholars.

Furthermore, I’d suspect things to be fairly tails-based (where EG the top 1-3 research outputs and the top 1-5 scholars are responsible for most of the impact).

Perhaps MATS as a program feels weird about ranking output or scholars so explicitly, or feels like it’s not their place.

But I think this kind of information seems extremely valuable. If I were considering whether or not I wanted to donate, for instance, my main questions would be “is the research good?” and “is the career development producing impactful people?” (as opposed to things like “what is the average rating on the EOY survey?”, though of course that information may matter for other purposes).

Can you give an example of a theoretical argument of the sort you'd find convincing? Can be about any X caring about any Y.

Nicely written. But... no? Obviously no?

Direct Instruction is a thing that has studies on it, for one.

How about reading a fun book and then remembering the plot?

Spaced repetition on flashcards of utter pointless trivia seems to work quite well for its intended purpose.

Learning how to operate a machine just from reading the manual is a key skill for both soldiers and grad students.

Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).

I would be up for having a dialogue with Nate. Quintin, myself, and the others in the Optimist community are working on posts which will more directly critique the arguments for pessimism.

Yeah, not a ton. For I think the obvious reason that real-world agents are complicated and hard to reason about.

Though search up "tiling agents" for some MIRI work in this vein.

I, too, have completed the survey :)

Thanks, it was fun to fill out!

As a recent example, from this article on the recent OpenAI kerfufle:

Two people familiar with the board’s thinking say that the members felt bound to silence by confidentiality constraints.

You are welcome! Thank you for taking it :)

Strong upvote, for mentioning a dialogue on both sides would be a huge positive for people's careers. I actually can see the discussion be even as big as influencing the scope of how we should think "about what is and is not easy in alignment". Hope Nate and @Nora Belrose are up for that. the discussion will be a good thing to document, deconfuse the divide between both perspectives.

(Edit: But to be fare with Nate, he does explain in his posts why alignment or solving the alignment problem is hard to solve. So maybe more elaboration on the other camp is much more required.)

I want to see a dialogue happen between someone with Nate's beliefs and someone with Nora's beliefs. The career decisions of hundreds of people, including myself, depend on clearly thinking through the arguments behind various threat models. I find it pretty embarrassing for the field that there is mutual contempt between people who disagree the most, when such a severe disagreement means the greatest opportunity to understand basic dynamics behind AGI.

Sure, communication is hard sometimes so maybe the dialogue is infeasible, and in fact I can't think of any particular people I'd want to do this. It still makes me sad.

Dunno. My random guess would be Meaningness, but it's probably not.

I almost stopped reading after Alice's first sentence because https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target

The rest was better, though I think that the more typical framing of this argument is better - what this is really about is models in RL. The thought experiment can be made closer to real-life AI by talking about model-based RL, and more tenuous arguments can be made about whether learning a model is convergent even for nominally model-free RL.

Cool idea. I think in most cases on the list, you'll have some combination of information asymmetry and an illiquid market that make this not that useful.

Take used car sales. I put my car up for sale and three prospective buyers come to check it out. We are now basically the only four people in the world with inside knowledge about the car. If the car is an especially good deal, each buyer wants to get that information for themselves but not broadcast it either to me or to the other buyers. I dunno man, it all seems like a stretch to say that the four of us are going to find a prediction market worth it,.

Yup, this all seems basically right. Though in reality I'm not that worried about the "we might outlaw some good actions" half of the dilemma. In real-world settings, actions are so multi-faceted that being able to outlaw a class of actions based on any simple property would be a research triumph.

Also see https://www.lesswrong.com/posts/LR8yhJCBffky8X3Az/using-predictors-in-corrigible-systems or https://www.lesswrong.com/posts/qpZTWb2wvgSt5WQ4H/defining-myopia for successor lines of reasoning.

If AGI is possible, then its superintelligence is a given

It needs to happen quickly or surreptitiously to be a problem.

I don’t think misalignment is highly conjunctive

Incorrigible misalignment is at least one extra assumption.

why is “superintelligence + misalignment” highly conjunctive?

In the sense that matters, it needs to be fast, surreptitious, incorrigible, etc.

If AGI is AGI, there won’t be any problems to notice

Huh?

The "AI is easy to control" piece does talk about scaling to superhuman AI:

In what follows, we will argue that AI, even superhuman AI, will remain much more controllable than humans for the foreseeable future. Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability.

If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.

However, there are weaker notions of control which are insufficient for this sort of boostrapping argument. Suppose each generation can ensure a the following weaker notion of control "we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple". This notion of control doesn't (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn't the only issue, but it is a sufficient counterexample.)

This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.

(See also my discussion of using human level ish AIs to automate safety research in the sibling.)

This comment seems to be assuming some kind of hard takeoff scenario, which I discount as absurdly unlikely. That said, even in that scenario, I'm pretty optimistic about our white box alignment methods generalizing fine.

(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I'm forgetting?), the SAE on an actual transformer indeed has higher correlation.)

In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).

Eh, random people complain. Screenshots of text seems fine, especially in shortform. It honestly seems fine anywhere. I also really don't think that accessibility should matter much here, the number of people reading on a screenreader or using assistive technologies are quite small, if they browse LessWrong they will already be running into a bunch of problems, and there are pretty good OCR technologies around these days that can be integrated into those. 

A thought experiment: the mildly xenophobic large alien civilization.

Imagine at some future time we encounter an expanding grabby aliens civilization. The civilization is much older and larger than ours, but cooperates poorly. Their individual members tend to have a mild distaste for the existence of aliens (such as us). It isn't that severe, but there are very many of them, so their total suffering at our existence and wish for us to die outweighs our own suffering if our AI killed us, and our own will to live.

They aren't going to kill us directly, because they co-operate poorly, individually don't care all that much, and defense has the advantage over offense.

But, in this case, the AI programmed as you proposed will kill us once if finds out about these mildly xenophobic aliens. How do you feel about that? And do you feel that, if I don't want to be killed in this scenario, that my opposition is unjustified? 

I have taken the survey. Or at least the parts I can remember with my aging brain.

I'd expect the amount of time this all takes to be a function of the time-control.

Like, if I have 90 mins, I can allocate more time to all of this. I can consult each of my advisors at every move. I can ask them follow-up questions.

If I only have 20 mins, I need to be more selective. Maybe I only listen to my advisors during critical moves, and I evaluate their arguments more quickly. Also, this inevitably affects the kinds of arguments that the advisors give.

Both of these scenarios seem pretty interesting and AI-relevant. My all-things-considered guess would be that the 20 mins version yields high enough quality data (particularly for the parts of the game that are most critical/interesting & where the debate is most lively) that it's worth it to try with shorter time controls.

(Epistemic status: Thought about this for 5 mins; just vibing; very plausibly underestimating how time pressure could make the debates meaningless).

It is not clear to me exactly what "belief regarding suffering" you are talking about, what you mean by "ordinary human values"/"your own personal unique values". 

Belief regarding suffering: the belief that s-risks are bad, independently of human values as would be represented in CEV.

Ordinary human values: what most people have.

Your own personal unique values: what you have, but others don't.

Please read the paper, and if you have any specific points of disagreement cite the passages you would like to discuss. Thank you

In my other reply comment, I pointed out disagreements with particular parts of the paper you cited in favour of your views. My fundamental disagreement though, is that you are fundamentally relying on an unjustified assumption, repeated in your comment above:

even if s-risks are very morally undesirable (either in a realist or non-realist sense)

The assumption being that s-risks are "very morally undesirable", independently of human desires (represented in CEV). 

Thanks for the reply.

We don't work together with animals - we act towards them, generously or not.

That's key because, unlike for other humans, we don't have an instrumental reason to include them in the programmed value calculation, and to precommit to doing so, etc. For animals, it's more of a terminal goal. But if that terminal goal is a human value, it's represented in CEV. So where does this terminal goal over and above human values come from?

Regarding 2:

There is (at least) a non-negligible probability that an adequate implementation of the standard CEV proposal results in the ASI causing or allowing the occurrence of risks of astronomical suffering (s-risks).

You don't justify why this is a bad thing over and above human values as represented in CEV.

Regarding 2.1:

The normal CEV proposal, like CEO-CEV and men-CEV, excludes a subset of moral patients from the extrapolation base.

You just assume it, that the concept of "moral patients" exists and includes non-humans. Note, to validly claim that CEV is insufficient, it's not enough to say that human values include caring for animals - it has to be something independent of or at least beyond human values. But what? 

Regarding 4.2:

However, as seen above, it is not the case that there are no reasons to include sentient non-humans since they too can be positively or negatively affected in morally relevant ways by being included in the extrapolation base or not.

Again, existence and application of the "moral relevance" concept over and above human values just assumed, not justified.

regarding 3.2:

At any given point in time t, the ASI should take those actions that would in expectation most fulfil the coherent extrapolated volition of all sentient beings that exist in t.

Good, by focusing at the particular time at least you aren't guaranteeing that the AI will replace us with utility monsters. But if utility monsters do come to exist or be found (e.g. utility monster aliens) for whatever reason, the AI will still side with them, because:

Contrary to what seems to be the case in the standard CEV proposal, the interests of future not-yet-existing sentient beings, once they exist, would not be taken into account merely to the extent to which the extrapolated volitions of currently existing individuals desire to do so.

Also, I have to remark on:

Finally, it should also be noted that this proposal of SCEV (as CEV) is not intended as a realist theory of morality, it is not a description of the metaphysical nature of what constitutes the ‘good’. I am not proposing a metaethical theory but merely what would be the most morally desirable ambitious value learning proposal for an ASI.

You assert your approach is "the most morally desirable" while disclaiming moral realism. So where does that "most morally desirable" come from?

And in response to your comment:

Yes, but (as I argue in 2.1 and 2.2) there are strong reasons to include all sentient beings. And (to my knowledge) there are no good reasons to support any religion.

The "reasons" are simply unjustified assumptions, like "moral relevance" existing (independent of our values, game theoretic considerations including pre-commitments, etc.) (and yes, you don't explicitly say it exists independent of those things in so many words, but your argument doesn't hold unless they do exist independently).

I find your text confusing. Let’s go step by step.

AlphaZero-chess has a very simple reward function: +1 for getting checkmate, -1 for opponent checkmate, 0 for draw A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different. By analogy:

The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode). A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least) If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.

I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.

So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.

To respond to this:

So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)

This doesn't actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.

Essentially, I'm focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.

So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven't enumerated because it would make this comment too long.

It takes a lot of time for advisors to give advice, the player has to evaluate all the suggestions, and there's often some back-and-forth discussion. It takes much too long to make moves in under a minute.

For google forms, if the question is not required, you can click on the same radio button twice to cancel the selection

It is not clear to me exactly what "belief regarding suffering" you are talking about, what you mean by "ordinary human values"/"your own personal unique values". 

As I argue in Section 2.2., there is (at least) a non-negligible chance that s-risks occur as a result of implementing human-CEV, even if s-risks are very morally undesirable (either in a realist or non-realist sense).

Please read the paper, and if you have any specific points of disagreement cite the passages you would like to discuss. Thank you

Well, it's more "you steal 8 billion dollars and gamble them on all-or-nothing bets where if you win you are planning to spend them on EA". I think that totally counts as EA.

Like, Sam spent that money in the hopes of growing FTX, and the was building FTX for earning to give reasons.

A simpler way to phrase my question is "If you steal 8 billion and spend 7.9 billion on non-EA things, did you really do it for EA?"

Hi simon, 

it is not clear to me which of the points of the paper you object to exactly, and I feel some of your worries may already be addressed in the paper. 

For instance, you write: "And that's relevant because  they are actually existing entities we are working together with on this one planet." First, some sentient non-humans already exist, that is, non-human animals. Second, the fact that we can work or not work with given entities does not seem to be what is relevant in determining whether they should be included in the extrapolation base or not, as I argue in sections 2., 2.1., and 4.2.

For utility-monster-type worries and worries about the possibility that "misaligned" digital minds would take control see section 3.2.

You write: "Well then, anyone can say Y is the all-important thing about anything obviously important to them. A religious person might want an AI to follow the tenets of their religion." Yes, but (as I argue in 2.1 and 2.2) there are strong reasons to include all sentient beings. And (to my knowledge) there are no good reasons to support any religion. As I argue in the paper and has been argued elsewhere, the first values you implement will change the ASI's behaviour in expectation, and as a result, what values to implement first cannot be left to the AI to be figured out. For instance, because we have better reasons to believe that all sentient beings can be positively or negatively affected in morally relevant ways than to believe that only given members of a specific religion matter, it is likely best to include all sentient beings than to include only the members of the religion. See Section 2.

I guess another thing I'm wondering about, is how we could tell apart genes that impact a trait via their ongoing metabolic activities (maybe metabolic is not the right term... what I mean is that the gene is being expressed, creating proteins, etc, on an ongoing basis), versus genes that impact a trait via being important for early embryonic / childhood development, but which aren't very relevant in adulthood.

Yes, this is an excellent question. And I think it's likely we could (at least for the brain) thanks to some data from this study that took brain biopsies from individuals of varying stages of life and looked at the transcriptome of cells from different parts of the brain.

My basic prior is that the effect of editing is likely to be close to the same as if you edited the same gene in an embryo iff the peak protein expression occurs in adulthood. Though there aren't really any animal experiments that I know of yet which look at how the distribution of effect sizes vary by trait and organ.

Is this clear enough:

I posit that the reason that humans are able to solve any coordination problems at all is that evolution has shaped us into game players that apply something vaguely like a tit-for-tat strategy meant to enforce convergence to a nearby Schelling Point / Nash Equilibrium, and to punish defectors from this Schelling Point / Nash Equilibrium. I invoke a novel mathematical formalization of Kant's Categorical Imperative as a potential basis for coordination towards a globally computable Schelling Point. I believe that this constitutes a promising approach to the alignment problem, as the mathematical formalization is both simple to implement and reasonably simple to measure deviations from. Using this formalization would therefore allow us both to prevent and detect misalignment in powerful AI systems. As a theory of change, I believe that applying RLHF to LLM's using a strong and consistent formalization of the Categorical Imperative is a plausible and reasonably direct route to good outcomes in the prosaic case of LLM's, and I believe that LLM's with more neuromorphic components added are a strong contender for a pathway to AGI.

I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.

I'm not Nate, but a pretty good theoretical argument that X method of making AIs would lead to an AI that "cared" about the user would do it for me, and I can sort of conceive of such arguments that don't rely on really good mechanistic interpretability.

I find your text confusing. Let’s go step by step.

  • AlphaZero-chess has a very simple reward function: +1 for getting checkmate, -1 for opponent checkmate, 0 for draw
  • A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
  • If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.

By analogy:

  • The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
  • A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
  • If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.

Do you agree with all that?

If so, then there’s no getting around that getting the right innate reward function is extremely important, right?

So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you.  :)

It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.

Perhaps, but I want to create a distinction between "People train AI to do good things and aren't able to control AI for a variety of reasons, and thus humans are extinct, made into slaves, etc." and "People train AI to do stuff like bio-terrorism, explicitly gaining power, etc, and thus humans are extinct, made into slaves, etc." Because the optimal responses look very different if we are in a world where control is easy but preventing misuse is hard, vs if controlling AI is hard in itself, because AI safety actions as currently done are optimized far more for the case where controlling AI by humans is hard or impossible, but if this is not the case, then pretty drastic changes would need to be made in how AI safety organizations do their work, especially their nascent activist wing, and instead focus on different policies.

People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.

In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.

https://www.lesswrong.com/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1

https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control#3_4_Very_few_relevant_actors__and_great_understanding_of_AGI_safety

I note that your example of them spouting nonsense only has the full force it does if we assume that controlling AI is hard, which is what we are debating right now.

Onto my point here, my fundamental claim is that there's a counterforce to what you describe to the claim that there will be more and more people being able to make an out of control AI agent, and that is the profit motive.

Hear me out, this will actually make sense here.

Basically, the main reasons that the profit motive is positive for safety is that the negative externalities of AI being not controllable is far, far more internalized to the person who's making the AI, since they also suffer severe losses in profitability without getting any profit from the AI. This is combined with the fact that they also have profit in developing safe control techniques, assuming that it isn't very hard, since the safe techniques will probanbly get used in government standards for releasing AI, and there's already at least some fairly severe barriers to any release of misaligned AGI, at least assuming that there's no treacherous turn/deceptive alignment over weeks-months.

Jaime Sevilla basically has a shorter tweet on why this is the case, and I also responded to LInch making something like the points above:

https://archive.is/wPxUV

https://twitter.com/Jsevillamol/status/1722675454153252940

https://archive.is/3q0RG

https://twitter.com/SharmakeFarah14/status/1726351522307444992

Iterated Amplification is a fairly specific proposal for indefinitely scalable oversight, which doesn't involve any human in the loop (if you start with a weak aligned AI). Recursive Reward Modeling is imagining (as I understand it) a human assisted by AIs to continuously do reward modeling; DeepMind's original post about it lists "Iterated Amplification" as a separate research direction. 

"Scalable Oversight", as I understand it, refers to the research problem of how to provide a training signal to improve highly capable models. It's the problem which IDA and RRM are both trying to solve. I think your summary of scalable oversight: 

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)

is inconsistent with how people in the industry use it. I think it's generally meant to refer to the outer alignment problem, providing the right training objective. For example, here's Anthropic's "Measuring Progress on Scalable Oversight for LLMs" from 2022:

To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016).

It references "Concrete Problems in AI Safety" from 2016, which frames the problem in a closely related way, as a kind of "semi-supervised reinforcement learning". In either case, it's clear what we're talking about is providing a good signal to optimize for, not an AI doing mechanistic interpretability on the internals of another model. I thus think it belongs more under the "Control the thing" header.

I think your characterization of "Prosaic Alignment" suffers from related issues. Paul coined the term to refer to alignment techniques for prosaic AI, not techniques which are themselves prosaic. Since prosaic AI is what we're presently worried about, any technique to align DNNs is prosaic AI alignment, by Paul's definition.

My understanding is that AI labs, particularly Anthropic, are interested in moving from human-supervised techniques to AI-supervised techniques, as part of an overall agenda towards indefinitely scalable oversight via AI self-supervision.  I don't think Anthropic considers RLAIF an alignment endpoint itself. 

To the people downvoting/disagreeing, tell me:

Where does your belief regarding suffering come from?

Does it come from ordinary human values?

  • great, CEV will handle it.

Does it come from your own personal unique values?

  • the rest of humanity has no obligation to go along with that

Does it come from pure logic that the rest of us would realize if we were smart enough?

  • great, CEV will handle it.

Is it just a brute fact that suffering of all entities whatsoever is bad, regardless of anyone's views? And furthermore, you have special insight into this, not from your own personal values, or from logic,  but...from something else?

  • then how are you not a religion? where is it coming from?

I saw this writeup only now (liked from the 2023 survey). Thanks for writing it up. I especially liked your comment on row 99.

Some thoughts on your notes on income:

  • With the low sample size a single wealthy person can break the average - and the survey has 3% "independently wealth".
  • To compare with the US mean you need to weigh by population composition. The respondents tend to be young (students?) and wouldn't have "average" income.  
  • The high average IQ of responders means higher average income. It correlates, right? 

The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.

I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.

My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.


The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)

We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").

Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.


I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:

  1. your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
  2.  because of correlations present in your underlying data (the thing that you seem to be worried about).

Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:

  • Most clusters in NN activations on real data might be of the first type
    • This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
  • Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
    • E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).

Yeah, I guess I view Rugby and American football as being essentially combat sports. This may be worth clarifying in the post, but no one who read it and then found out "oh this person actually did Rugby not wrestling" would be particularly surprised.

Still this is somewhat an illustration of the general problem, there are often many adjacent and some non-adjacent alternative explanations.

You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor

It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.

Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we're worried about misuse of AI for say bioterrorism.

People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.

In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.

If KYC laws aren’t the answer, what is? I don’t know. I’m not advocating for any particular policy here.

So what I'm trying to get at here is essentially the question "how much can we offload the complexity of values to the learning system" rather than say, directly specify it via the genome, say. In essence, I'm focused on the a priori complexity of human values and the human innate reward function, since this variable often is a key disagreement between optimists and pessimists on controlling AI, and in particular it especially matters for how likely deceptive alignment is to occur relative to actual alignment, which is both a huge and popular threat model.

Re the reward function, the prior discussion also sort of applies here, because if it is learnable or otherwise is simple to hardcode, then it means that other functions probably will work just as well without relying on the human reward function, and if it's outright learnable by AI, then it's almost certainly going to be learned (conditional on the reward function being simple.) before anything else, especially the deceptively aligned algorithm if it's simpler, and if not, then it's only slightly more complex, so we can easily provide very little data to distinguish between the 2 algorithms, which is what I view the situation involving the human

My crux is that this statement is probably false, conditioning on either it being very simple to hardcode, as in a few lines say, or is learnable by the self-learning algorithm/within-lifetime RL/online learning algorithms you consider:

"The human innate reward function is absolutely critical to human prosocial behavior."

Putting it another way, I deny the specialness of the innate reward function in humans being the main driver, because most of that reward function has to be learned, which could be replicated by brain-like AGI/Model-Based RL via online learning, thus most of the complexity does not matter, and that also probably implies that most of the complex prosocial behavior is fundamentally replicable by a brain-like AGI/Model-Based RL agent without having to have the human innate reward function.

The innate function obviously has some things hard-coded a priori, and there is some complexity in the reward function, but not nearly as much as a lot of people think, since IMO a lot of the reward function/human prosocial values are fundamentally learned and almost certainly replicable by a Brain-like AGI paradigm, even if it didn't use the exact innate reward function the human uses.

Some other generalized updates I made were these, this is quoted from a discord I'm in, credit to TurnTrout for noticing this:

An update of "guess simple functions of sense data can entrain this complicated edifice of human value, along with cultural information" and the update of "alignment to human values is doable by a simple function so it's probably doable by lots of other functions",

as well as contextualized updates like "it was probably easy for evolution to find these circuits, which is evidence you don't need that much precision in your reward specification to get roughly reasonable outputs".

Is there a reason you’re using 3 hour time control? I’m guessing you’ve thought about this more than I have, but at first glance, it feels to me like this could be done pretty well with EG 60-min or even 20-min time control.

I’d guess that having 4-6 games that last 20-30 mins gives is better than having 1 game that lasts 2 hours.

(Maybe I’m underestimating how much time it takes for the players to give/receive advice. And ofc there are questions about the actual situations with AGI that we’re concerned about— EG to what extent do we expect time pressure to be a relevant factor when humans are trying to evaluate arguments from AIs?)

It’s called “responsible scaling”. In its own name, it conveys the idea that not further scaling those systems as a risk mitigation measure is not an option.

That seems like an uncharitable reading of "responsible scaling." Strictly speaking, the only thing that name implies is that it is possible to scale responsibly. It could be more charitably interpreted as "we will only scale when it is responsible to do so." Regardless of whether Anthropic is getting the criteria for "responsibility" right, it does seem like their RSP leaves open the possibility of not scaling. 

(Didn't consult Nora on this; I speak for myself)


I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere
 

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).

Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""

This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself. 
 

Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."

I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it). 
 

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO): 

As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.


Re: "Overall take: unimpressed."

I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive. 

We have Wildeford's Third Law: "Most >10 year forecasts are technically also AI forecasts".

We need a law like "Most statements about the value of EA are technically also AI forecasts".

I've confused you with people who deny that a misaligned AGI is even capable of killing most humans. Glad to be wrong about you.

But I am not saying that the doom is unlikely given superintelligence and misalignment, I am saying the argument that gets there -- superintelligence + misalignment -- is highly conjunctive. The final step., the execution as it were, is no highly conjunctive.

But I don't agree that it's highly conjunctive.

  • If AGI is possible, then its superintelligence is a given. Superintelligence isn't given only if AGI stops at human level of intelligence + can't think much faster than humans + can't integrate abilities of narrow AIs naturally. (I.e. if AGI is basically just a simulation of a human and has no natural advantages.) I think most people don't believe in such AGI.
  • I don't think misalignment is highly conjunctive.

I agree that hard takeoff is highly conjunctive, but why is "superintelligence + misalignment" highly conjunctive?

I think its needed for the "likely". Slow takeoff gives humans more time to notice and fix problems, so the likelihood of bad outcomes goes down. Wasn't that obvious?

If AGI is AGI, there won't be any problems to notice. That's why I think probability doesn't decrease enough.

...

I hope that Alignment is much easier to solve than it seems. But I'm not sure (a) how much weight to put into my own opinion and (b) how much my probability of being right decreases the risk.

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.

I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually 'want' to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it's not just demonstrating an understanding of the task: it's actually performing actions in the real world that result in the task being completed. I think it's totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to "care" about doing tasks for users.

It's not extremely strong evidence, because future AIs could be way harder to align, maybe there's ultimately no coherent sense in which GPT-4 "cares" about things, and perhaps GPT-4 is somehow just "playing the training game" despite seemingly having limited situational awareness. 

But I think it's valid evidence nonetheless, and I think it's wrong to round this datum off to a mere demonstration of "understanding". 

We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they "understand" how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.

Alternative explanations of aligned behavior we see are always logically possible, and it's good to try to get a more mechanistic understanding of what's going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.

It's a bit ironic that the app idea doesn't work in practice for the same reasons that communism doesn't work in practice.

I agree with some of this, but I'd say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.

You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?

If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?

(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)

I agree. But telling them to read the sequences is still pointless.

it's a workaround for a broken downvote system. "don't like my post, legit. please be aware the downvote system will ban me if I get heavily downvoted".

In what way is rugby wildly different from combat :-)

Answer by ViliamDec 02, 202331

Shortly:

  1. Post the actual text. Not a link to a link to a download site.
  2. Make your idea clear. No joking, no... whatever. The inferential distance is too large, the chance to be understood is already too low. Don't make it worse.
  3. Post one idea at a time. If your article contains dozen ideas and I disagree with all of them, I am just going to click the downvote button without an explanation. If your article contains one idea and I disagree with it, I may post a reason why I disagree.
  4. Don't post many articles at the same time. Among other things, it sends a clear signal that you are not listening to feedback.
  5. This actually should be point zero: Consider the possibility that you might actually be wrong. (From my perspective, this possibility is very high.)

A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you're right about some people. I don't know what 'lots of alignment folk' means, and I've not considered the topic of other-people's-update-rates-and-biases much.


For me, most changes route via governance.

I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions.

I've also made negative (evidential) updates based on the recent OpenAI kerfuffle (more weak evidence that Sam+OpenAI is misaligned; more evidence that org oversight doesn't work well), though I think the causal fallout remains TBC.

Seemingly-mindkilled discourse on East-West competition provided me some negative updates, but recent signs of life from govts at e.g. the UK Safety Summit have undone those for now, maybe even going the other way.

I've adapted my own priorities in light of all of these (and I think this adaptation is much more important than what my P(doom) does).


Besides their second-order impact on Overton etc. I have made very few updates based on public research/deployment object-level since 2020. Nothing has been especially surprising.

From deeper study and personal insights, I've made some negative updates based on a better appreciation of multi-agent challenges since 2021 when I started to think they were neglected.

I could say other stuff about personal research/insights but they mainly change what I do/prioritise/say, not how pessimistic I am.


  1. I've often thought that P(doom) is basically a distraction and what matters is how new news and insights affect your priorities. Of course, nevertheless, I presumably have a (revealed) P(doom) with some level of resolution. ↩︎

In the UK, I think the most common assumption for cauliflower ear would be playing rugby, rather than a combat sport.

No idea if that's the statistically correct inference from seeing someone with the condition.

Completed the survey. I liked the additional questions you added, and the overall work put into this. Thanks!

I think you are completely missing the entire point of the AI alignment problem.

The problem is how to make the AI recognize good from evil. Not whether upon recognizing good, the AI should print "good" to output, or smile, or clap its hands. Either reaction is equally okay, and can be improved later. The important part is that AI does not print "good" / smile / clap its hands when it figures out a course of action which would, as a side effect, destroy humankind, or do something otherwise horrible (the problem is to define what "otherwise horrible" exactly means). Actually it is more complicated by this, but you are already missing the very basics.

Well what's the appropriate way to act in the face of the fact that I AM sure I am right?

  1. Change your beliefs
  2. Convince literally one specific other person that you're right and your quest is important, and have them help translate for a broader audience

I know there's a tradeoff here with driving traffic to your Substack

Why not post the contents of the papers directly on Substack? They would only be one click away from here, and would not compete against Substack.

From my perspective, adacemia.edu and Substack are equally respectable (that is, not at all).

I agree that my suggestion was not especially helpful.

If someone is too wrong, and explicitly refuses to update on feedback, it may be impossible to give them a short condensed argument.

(If someone said that Jesus was a space lizard from another galaxy who came to China 10000 years ago, and then he publicly declared that he doesn't actually care whether God actually exists or not... which specific chapter of the Bible would you recommend him to read to make him understand that he is not a good fit for a Christian web forum? Merely using the "Jesus" keyword is not enough, if everything substantial is different.)

Thanks, Aaron! That's helpful to hear. I think "forgetting" is a good candidate explanation because scholars answered that question right after competing Alignment 201, which is designed for breadth. Especially given the expedited pace of the course, I wouldn't be surprised if people forgot a decent chunk of A201 material over the next couple months. Maybe for those two scholars, forgetting some A201 content outweighed the other sources of breadth they were afforded, like seminars, networking, etc.

Do we still have the ancient tradition of upvoting survey completion?

Answer by ViliamDec 02, 202366

I am not sure how to proceed with my alignment work in the absence of a quorum of people willing to upvote the most valuable and important parts of my research agenda.

I think the answer is: don't.

Going by the feedback, your research agenda seems valuable and important to you, but not to the LessWrong community. So there is no reason why LessWrong should host your Sequence. (I would tell you to put it on Substack instead, but you already did.)

That's fair to 'aspire to a higher standard,' and I'll avoid adding screenshots of text in the future.

However, I must say, the 'higher standard' and commitment to remain serious for even a shortform post kind of turns me off from posting on LessWrong in the first place. If this is the culture that people here want, then that's fine and I won't tell this website to change, but I don't personally like the (what I find as) over-seriousness.

I do understand the point about sharing text to make it easier for disabled people (I just don't always think of it).

In my neighborhood there are some similar activities.

About once in a year there is a "Money-Free Zone", which means that someone rented a big room for a day (for example a gym at a school, during a weekend) and put there some tables. People who want to donate stuff come there, give the stuff to organizers, the organizers sort it out and place it on the tables. Then everyone is free to take whatever they want. At the end of the day, organizers put the remaining stuff to bags and offer it to some charities, and I suppose whatever is rejected ultimately gets thrown out.

This requires some money and work, but only on the side of the organizers. For everyone else it is free. For people like me it is actually a good opportunity to get rid of some things I no longer need, so I usually give about as much as I take. The event is open for everyone, and giving is purely optional.

The problem is that this works okay as long as giving and taking is at least somewhat balanced. I do not need to take as much as I give, but if I take literally nothing, it removes a large part of the incentive to come the next time. Most of the time it is okay -- I suppose because most poor people do not get the memo? though that explanation sounds a bit weird -- but I have heard that at some places the event was overrun by hordes of poor people (sometimes poor smelly people) which was a bad experience for the donors, so the next year the event was organized at a different location and was not advertised publicly; it was still open for everyone who came, but you needed to be lucky and get the info through the grapevine.

We also have a neighborhood group on Facebook, and related to it there is a mutual donation group, that only the members of the neighborhood group can join. If you want to get rid of something, you post a photo and a description, and the first person who replies can take it.

There is also a website for people selling to each other, where you can also "sell" for a price of 0 €.

Compared to the American versions, as described on Wikipedia, it seems to me that our local version is much less ideological. Like, the Facebook group is not ideological at all, the spirit is "neighbors offering stuff to each other"; at the selling website the spirit is "this is so cheap that I am actually not even asking money for it". Only the Money-Free Zone has some ideological connotations in the title (it may appeal to people who believe that money is a bad idea in general), but the activity itself is very factual: you bring stuff, you take stuff, no one is giving you lectures on anything. I suspect that whatever is your motivation for organizing such events, not pushing your ideology on the participants makes it a better experience.

(I assume that the main effect of a "Buy Nothing Day" is that people buy that stuff on the previous or the next day instead, so the weekly sales remain the same.)

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.

When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we can improve; we're very far from the 99% standard. But the way we improve won't be by "drilling into the residual"; that has been tried and is insufficient. EDIT: Possibly by "drill into the residual" you mean "understand why the methods don't work and then improve them" -- if so I agree with that but also think this is what mech interp researchers want to do.

(Why am I still optimistic about interpretability? I'm not convinced that the 99% standard is required for downstream impact -- though I am pretty pessimistic about the "enumerative safety" story of impact, basically for the same reasons as Buck and Ryan afaict.)

There seems to be a bit of pushback against "postmortem" and our team is ambivalent, so I changed to "retrospective."

I think I see where you're coming from on this, but there are a few things to consider:

First, a lot of your criticisms apply most strongly to my own particular idiosyncratic method, and when evaluating it solely as an effective altruism strategy. In fact, I chose the method I did largely as a variety of conscientious objection, not as effective altruism. My post here highlighted the possibilities of tax resistance as an effective altruism strategy, but my own motives for my resistance are more complicated and I did not choose my own method of resistance to optimize its charitable donation possibilities. If you judge it by that standard, it will admittedly look pretty weak. But it's also possible to choose tax resistance methods differently from how I have done, in a way that prioritizes effective altruism over conscientious objection, if your motives are different from mine.

Second, I think you exaggerate the precariousness of my position. I'm not impoverished. I'm actually doing pretty well. I put aside something like 40% of my income for retirement, and every year I put roughly the equivalent of my health insurance deductible into a Health Savings Account in case disaster (or distracted driver) strikes. I make about the median annual income for an individual in the U.S., and have saved up more than the median retirement savings for someone in my age bracket. I'm not "brutally curtailed" or living in "self-imposed poverty". I'm a reasonably well-off person living in the lap of luxury here in California and enjoying the fruits of the most fabulously prosperous time our species has yet experienced. I can't imagine feeling deprived like this.

Third, you underestimate the charitable impact of my resistance if you only include the $5k/year or so that I donate and ignore the hundreds of hours of volunteer work (not, perhaps, effective-altruistically optimized, but nonetheless good) my particular technique has helped me to put in.

Fourth, your argument that "if you wanted to fix any of this, you... couldn't pay off your existing $90k+ liability" is incorrect. If for some reason I changed my mind about all this and wanted to wipe the slate clean, if I were too poor to just pay the full amount, the IRS is like many debt collectors in this regard: it would rather get something than fail to get everything, so it's willing to bargain. It will ask you what you can afford (demanding that you fess up about your income and assets) and then come up with some figure that doesn't totally bankrupt you, telling you that you can eliminate your tax debt entirely if you can come up with this lower sum. It's called the Offer in Compromise program (https://www.irs.gov/payments/offer-in-compromise).

EDIT: This was supposed to be a reply to the answer by Valdes, but for some reason LW keeps posting it as a separate answer. No idea why.

Most people are incompetent, and the competent ones are usually busy. So unless you pay the market price (quite high) or the project is super exciting (to someone other than you), you will get a crappy app. It's not just the code, it can be a crappy design, or utter lack of empathy with the user.

Recently I use an app to record my daily medicine usage. The idea is that every day I take a pill, and I confirm that "yes, today I took a pill at X o'clock". (Or: "today I didn't". Or, I forget to enter either information today, so tomorrow I enter it also for the previous days.) Then it uploads the information to a server. How difficult can this get?

  • First I start the application, and I need to enter a password. Then it takes about a minute to authenticate me on a server. I appreciate the concern for privacy, but why the fuck can't the password just be verified locally?
  • Then I need to choose whether I want to report the pill usage, or read the tutorial. I have already read the tutorial, why can't it remember this simple fact and skip the screen?
  • The next screen tells me that I have reported the pill usage for yesterday, but not for today. Thank you, Captain Obvious, that's like 95% of situations when I use the app. Why can't you just skip this screen in such case, and only display it when something unexpected happens, such as I have already reported the pill usage for today, or I forgot to report it yesterday?
  • Then it asks me whether I took the pill today, and I need to check the "yes" or "no" option, and then click Next.
  • Then there is a screen that tells me to select time. I need to click a clock icon, it displays a modal dialog where I adjust the hours and minutes (by clicking small "+" and "-" buttons below them; if I click outside the small buttons, the modal dialog closes and I need to click the clock icon and enter the time again). Why couldn't these two screens plus the modal dialog replaced by one screen that displays the hours and minutes with the "+" and "-" buttons, plus another button "I didn't take the pill today"?
  • Then there is a screen telling me to review the information I entered, click the "ok" checkbox, and then click Next. Except it doesn't show the entered information, so the only way to review would be to click the "Back" button and check on the previous screen.
  • Then it take another minute to upload the information to the server, and then I can finally close the app.

I mean, it could worse, and you can get used to it, but I could also be way more convenient. But it is not, quite predictably. Crap is what you get by default, and there is no market mechanism to select for a higher quality app.

Load More