All of abramdemski's Comments + Replies

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that i... (read more)

This seems to prove too much in general, although it could be "right in spirit." If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process. I was responding to: I bet you can predict what I'm about to say, but I'll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network. So I think the statement "how well do the agent's motivations predict the reinforcement event" doesn't make sense if it's cast as "manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds)." I think it does make sense if you think about what behavioral influences ("shards") within the agent will upweight logits on the actions which led to reward.

I expect this argument to not hold, 

Seems like the most significant remaining disagreement (perhaps).

1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)

So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient t... (read more)

This seems stronger than the claim I'm making. I'm not saying that the agent won't deceptively model us and the training process at some point. I'm saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away. You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking []). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication. Anyways, here's another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training. The network doesn't "observe" more than that, initially. The network just gets updated by the loss function. It doesn't even know what the loss function is. It can't even see the gradients. It can't even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network. Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness): 1. RL develops a bunch of contextual decision-influences / shards 1. EG be nea

My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you're exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.

On my understanding, the thing to do is something like heuristic search, where "expanding a node" means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of differen... (read more)

Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much! Let me see how this would work. 1. Breaker: "The agent might wirehead because caring about physical reward is a high-reward policy on training" 2. Builder: "Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target." 3. Breaker: "So are we assuming a policy gradient-like algorithm for the RL finetuning?" 4. Builder: "Sure." 5. Breaker: "What if there's a subnetwork which is a reward maximizer due to LTH?" 6. ... If that's how it might go, then sure, this seems productive. I don't think I was mentally distinguishing between "the idealized builder-breaker process" and "the process as TurnTrout believes it to be usually practiced." I think you're right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don't know much about that. I'm critiquing my own historical experience with the process as I imperfectly recall it. Yes, I think this was most of my point. Nice summary. I expect this argument to not hold, but I'm not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it's true that LTH probabilistically ensures the existence of undesired-subnetwork, 1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.) 2. You're always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size." 3. Even if the agent is motivated both by the tra

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking ... (read more)

True, but I'm also uncertain about the relative difficulty of relatively novel and exotic value-spreads like "I value doing the right thing by humans, where I'm uncertain about the referent of humans", compared to "People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes" (the latter being values that at least I do in fact have).

If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.

This is because writing down the desired inductive biases as an explicit prior can help us to understand... (read more)

I doubt this due to learning from scratch.

I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.

Insofar as the ques

... (read more)
I agree, this does seem like it was a language dispute, I no longer perceive us as disagreeing on this point.

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not

... (read more)
I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?

With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.

Output-based evalu... (read more)

I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way. 

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent shou... (read more)

There are two questions to ask:

  1. How does the AI learn to care about this?
  2. What do we gain by making the AI care about this?

If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).

I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of ... (read more)

1Q Home3mo
I think we have slightly different tricks in mid: I'm thinking about a trick that any idea does. It's like solving an equation with an unknown: doesn't matter what you do, you split and recombine it in some way. Or you could compare it to Iterated Distillation and Amplification: when you try to repeat the content of a more complicated thing in a simpler thing. Or you could compare it to scientific theories: Science still haven't answered "why things move?", but it split the question into subatomic pieces. So, with this strategy the smaller piece you cut, the better. Because we're not talking about independent pieces. I think definition doesn't matter for (not) believing in this. And it's specific enough without a definition. I believe this: 1. There exist similar statements outside of human ethics/values which can be easily charged with human ethics/values. Let's call them "X statements". An X statement is "true" when it's true for humans. 2. X statements are more fine-grained and specific than moral statements, but equally broad. Which means "for 1 moral statement there are 10 true X statements" (numbers are arbitrary) or "for 1 example of a human value there are 10 examples of an X statement being true" or "for 10 different human values there are 10 versions of the same X statement" or "each vague moral statement corresponds to a more specific X statement". X statements have higher "connectivity". To give an example of a comparison between moral and X statements: "Human asked you to make paperclips. Would you turn the human into paperclips? Why not?" 1. Goal statement: "not killing the human is a part of my goal". 2. Moral statements: "because life/personality/autonomy/consent is valuable". (what is "life/personality/autonomy/consent"?) 3. X statements: "if you kill, you give the human less than human asked", "destroying the causal reason of your task is often meaningless", "inanimate objects can't be wo

The images in this classic reference post have gone missing! :(

Original images can be seen here.

This is just my intuition, but it seems like the core intuition of a "money system" as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price  everything must have a quantifiable utility). 

I think we can try to solve AI Alignment this way:

Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".

Basically, you ask the AI "make paperclips that have

... (read more)
1Q Home3mo
The AI doesn't have to know the precise price of everything. The AI needs to make sure that a price doesn't break the desired properties of a system. If paperclips are worth more than everything else in the universe, it would destroy almost any system. So, this price is unlikely to be good. There are two questions to ask: 1. How does the AI learn to care about this? 2. What do we gain by making the AI care about this? If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1). The point of my idea is that "human (meta-)ethics" is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about "values". So, we can replace the questions "how to encode human values?" and even "how to learn human values?" with more general questions "how to learn (properties of systems)?" and "how to translate knowledge about (properties of systems) to knowledge about human values?" In your proposal about normativity you do a similar "trick": * You say that we can translate the method of learning language into a method of learning human values. (But language can be as complicated as human values themselves and you don't say that we can translate results of learning a language into moral rules.) * I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). And I say that we can translate results of learning those simple systems into human moral rules. And that there're analogies of many important complicated properties (such as "corrigibility") in simple systems. So, I think this frame has a potential to make the problem a lot

Another good thing is that all of this isn't directly connected to human values, so you don't have to encode "absolute understanding of human values" in the AI.

I don't get this part, at all. (But I didn't understand the purpose/implications of most parts of the OP.)

Why doesn't the AI have to understand human values, in your proposal?

In the OP, you state:

The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the

... (read more)
3Q Home3mo
I checked out some of your posts (haven't read 100% of them): Learning Normativity: A Research Agenda [] and Non-Consequentialist Cooperation? [] You draw a distinction between human values and human norms. For example, an AI can respect someone's autonomy before the AI gets to know their values and the exact amount of autonomy they want. I draw the same distinction, but more abstract. It's a distinction between human values and properties of any system/task. AI can respect keeping some properties of its reward systems intact before it gets to know human values. I think even in very simple games an AI could learn important properties of systems. Which would significantly help the AI to respect human values.
1Q Home3mo
Here's the shortest formulation of my idea: You can split possible effects of AI's actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains: 1. (Not) accomplishing a goal. Utility functions are about this. 2. (Not) violating human values. Models of human feedback are about this. 3. (Not) modifying a system without breaking it. Impact measures [] are about this. My idea is about combining all of this (mostly 2 and 3) into a single approach. Or generalizing ideas for the third domain. There isn't a lot of ideas for the third one, as far as I know. Maybe people are not aware enough about that domain. I meant that some AIs need to start with understanding human values (perfectly) and others don't. Here's an analogy: 1. Imagine a person who respects laws. She ends up in a foreign country. She looks up the laws. She respects and doesn't break them. She has an abstract goal that depends on what she learns about the world. 2. Imagine a person who respects "killing people". She ends up in a foreign country. She looks up the laws. She doesn't break them for some time. She accumulates power. Then she breaks all the laws and kills everyone. She has a particular goal that doesn't depend on anything she learns. The point of my idea is to create an AI that respects abstract laws of systems, abstract laws of tasks. The AI of the 1st type. (Of course, in reality the distinction isn't black and white, but the difference still exists.)

I don't think my first Bayesian critique is "nine nines is too many"; there are physical problems with too much Bayesian confidence (eg "my brain isn't reliable enough that I should really ever be that confident"), but the simple math of Bayesian probability admits the possibility of nine nines just like anyone else.

I think my first critique is the false dichotomy between the null hypotheses and the hypothesis being tested.

Speaking for the frequentist, you say:

If you roll the die nine times and get nine 10s then you can say that the die is weighted with

... (read more)

I didn't know about that, it was good move from EA, why don't try it again?

My low-evidence impression is that there was a fair amount of repeated contact at one time. If it's true that that contact hasn't happened recently, it's probably because it hit diminishing returns in comparison with other things. I doubt people were in touch with Elon and then just forgot about the idea. So I conclude that the remaining disagreements with Elon are probably not something that can be addressed within a short amount of time, and would require significantly longer discussions to make progress on.

Still working on a more complete write-up!

Someone at LW told me about an argument-mapping website which aimed to provide an online forum where debate would actually be good -- an excellent place on the internet to check for the arguments on both sides of any issue, and all the relevant counterarguments to each.

Unfortunately, the moderators interpreted the "principle of charity" to imply no cynical arguments can be made; that is, the principle of charity was understood as a fundamental assumption that humans are basically good

This made some questions dealing with corruption, human intentions... (read more)

Wow. Okay, that's a good example.

This isn't a big deal if we treat steelmanning as niche, as a cool sideshow. But if we treat it as a fundamental conversational virtue, I think (to some nontrivial degree) it actively interferes with understanding and engaging with views you don't agree with, especially ones based on background views that are very novel and foreign to you.

So, I've been ruminating on the steelmanning question for a couple of months, and your position still doesn't sit easy with me.

Simply put, I think the steelmanning reflex is super important, and your model seems to downpl... (read more)

The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".

"Having own generative model" is the shakiest part.

What it means for the agent to "have a generative model" is that the agent systematically corrects this model based on its experience (to within some tolerable competence!).

It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened

... (read more)

I think the main problem is that expected utility theory is in many ways our most well-developed framework for understanding agency, but, makes no empirical predictions, and in particular does not tie agency to other important notions of optimization we can come up with (and which, in fact, seem like they should be closely tied to agency).

I'm identifying one possible source of this disconnect.

The problem feels similar to trying to understand physical entropy without any uncertainty. So it's like, we understand balloons at the atomic level, but we notice th... (read more)

Damn this is really good

I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs.

(Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyw... (read more)

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Interesting point! 

Maintain a model of Alice's beliefs which contains the specific things Alice is known to believe, and use that to predict Alice's actions in domains closely related to those beliefs.

It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own b... (read more)

No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions". For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

I didn't fix it, but I de-bolded all the other technical terms that I spuriously bolded, so that distributional shift now sticks out more even though it is not in the first sentence.

Not quite sure how to get it in the first sentence in a clean way, since I really feel I have to define IID first in order to define distributional shift properly.

Edited. Also added a tag description defining relevant terminology. 

"Distributional Shifts" seems like the more standard term imho. I'm considering re-naming. 

NB: the title no longer appears in bold in the first sentence, contra the style guide [] .
Edited. Also added a tag description defining relevant terminology.

I've often repeated scenarios like this, or like the paperclip scenario.

My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen

The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of o... (read more)

I note that none of these is obviously the same as the explanation Skyrms gives.

  • Skyrms is considering broader reasons for correlation of strategies than kinship alone; in particular, the idea that humans copy success when they see it is critical for his story.
  • Reciprocal altruism feels like a description rather than an explanation. How does reciprocal altruism get started?
  • Group selection is again, just one way in which strategies can become correlated.
1Andrew Currall5mo
Re: reciprocal altruism. Given the vast swathe of human prehistory, virtually anything not absurdly complex will be "tried" occasionally. It only takes a small number of people whose brains happen to wired to "tit-for-tat" to get started, and if they out-compete people who don't cooperate (or people who help everyone regardless of behaviour towards them), the wiring will quickly become universal. Humans do, as it happens, explicitly copy successful strategies on an individual level. Most animals don't though, and this has minimal relevance to human niceness, which is almost certainly largely evolutionary.

As this post notes, the human learning process (somewhat) consistently converges to niceness. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to niceness, but it still built such a learning process.

It therefore seems very worthwhile to understand what part of the human learning process allows for niceness to emerge in humans.

Skyrms makes the case for similar explanations at these two levels of description. Evolutionary dynamics and within-lifetime dynamics might be very different, but the explanation for h... (read more)

Skyrms makes the case that biological evolution and cultural evolution follow relevantly similar dynamics, here, so that we don't necessarily need to care very much about the distinction. The mechanistic explanation at both levels of description is similar.

I can't speak for OP, but I'm not interested in either kind of evolution. I want to think about the artifact which evolution found: The genome, and the brains it tends to grow. Given the genome, evolution's influence on human cognition is screened off. Why are people often nice to other agents? How does the genome do it, in conjunction with the environment?

The parent comment currently stands at positive karma and negative agreement, but the comments on it seem to be saying "what you are saying is true but not exactly relevant or not the most important thing" -- which would seem to suggest the comment should have negative or low karma but positive agreement instead.

On this evidence, I suspect voters and commenters may have different ideas; any voters want to express the reasons for their votes?

The title says 'steelmanning is niche'. I felt like the post didn't represent the main niche I see steelmanning (and charity) as useful for.

The way I see it, the main utility of steelmanning is when you are ripping apart someone's argument

If philosophy paper A disagrees with philosophy paper B, then A had better do the work of steelmanning. I don't primarily want to know what the author of B really thinks; I primarily want to know whether (and which of) their conclusions are correct. If the argument in B was bad, but there's a nearby argument that's... (read more)

4Rob Bensinger5mo
I agree, in the sense that any good treatment of 'is P true?' should consider the important considerations both for believing P and for not believing P. I don't care about 'steel-manning' if you're replacing a very weak argument with a slightly less weak argument; but I do care if you're bringing in a strong argument. (Indeed, I care about this regardless of whether there's a weaker argument that you're 'steel-manning'! So 'steel-man' is a fine reminder here, but it's not a perfect description of the thing that really matters, which is 'did I consider all the strong arguments/evidence on both sides?'.) I'll note that 'steel-manning' isn't exclusively used for 'someone else believes P; I should come up with better arguments for P, if their own arguments are insufficient'. It's also used for: * Someone believes P; but P is obviously false, so I should come up with a new claim Q that's more plausible and is similar to P in some way. In ordinary conversation, people tend to blur the line between 'argument', 'argument-step', and 'conclusion/claim'. This is partly because the colloquial word 'argument' is relatively vague; partly because people rarely make their full argument explicit; and partly because 'what claim(s) are we debating?' is usually something that's left a bit vague in conversation, and something that freely shifts as the conversation progresses. All of this means that it's hard to enforce a strict distinction (in real-world practice) between the norm 'if you're debating P with someone, generate and address the best counter-arguments against your view of P, not just the arguments your opponent mentioned' and the norm 'if someone makes a claim you find implausible, change the topic to discussing a different claim that you find more plausible'. This isn't a big deal if we treat steelmanning as niche, as a cool sideshow. But if we treat it as a fundamental conversational virtue, I think (to some nontrivial degree) it actively interferes with understa

I think the main explanation for our niceness is described by Skyrms in the book the evolution of the social contract and his follow-up book the stag hunt. The main explanation being: in evolutionary dynamics, genes spread geographically, so strategies are heavily correlated with similar strategies. This means it's beneficial to be somewhat cooperative.

Also, for similar reasons, iterated games are common in our evolutionary ancestry. Many animals display friendly/nice behaviors. (Mixed in with really not very friendly behaviors, of course.)

I also don't thi... (read more)

The parent comment currently stands at positive karma and negative agreement, but the comments on it seem to be saying "what you are saying is true but not exactly relevant or not the most important thing" -- which would seem to suggest the comment should have negative or low karma but positive agreement instead. On this evidence, I suspect voters and commenters may have different ideas; any voters want to express the reasons for their votes?
As Quintin wrote, you aren't describing a mechanistic explanation for our niceness. You're describing a candidate reason why evolution selected for the mechanisms which do, in fact, end up producing niceness in humans.

I also don't think this solution carries over very well to powerful AIs. A powerful AI has exceptionally little reason to treat its actions as correlated with ours, and will not have grown up with us in an evolutionary environment.

This seems correct, but I think that's also somewhat orthogonal to the point that I read the OP to be making. I read it to be saying something like "some alignment discussions suggest that capabilities may generalize more than alignment, so that when an AI becomes drastically more capable, this will make it unaligned with its ori... (read more)

There must have been some reason(s) why organisms exhibiting niceness were selected for during our evolution, and this sounds like a plausible factor in producing that selection. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.

As this post notes, the human learning process (somewhat) consistently converges t... (read more)

Genes being concentrated geographically is a fascinating idea, thanks for the book recommendation, I'll definitely have a look. Niceness does seem like the easiest to explain with our current frameworks, and it makes me think about whether there is scope to train agents in shared environments where they are forced to play iterated games with either other artificial agents or us. Unless an AI can take immediate decisive action, as in a fast take-off scenario, it will, at least for a while, need to play repeated games. This does seem to be covered under the idea that powerful AI would be deceptive, and pretend to play nice until it didn't have to, but somehow our evolutionary environment led to the evolution of actual care for others' wellbeing rather than only very sophisticated long-term deception abilities. I remember reading about how we evolved emotional reactions that are purposefully hard to fake, such as crying, in a sort of arms race against deception, I believe it's in How the Mind Works. This reminds me somewhat of that, where areas where people have genuine care for each other's well beings are more likely to propagate the genes concentrated there.

I think estimating the probability/plausibility of real-world inner alignment problems is a neglected issue.

However, I don't find your analysis very compelling.

Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version. 

Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems,... (read more)

I recently referred to this as my favorite movie. It's the movie I've re-watched most in my adult life.

Anything name-able and not hopelessly vague seems to be bad to full-strength optimize. Although we should be open to exceptions to that.

As a life philosophy, it might be pretty uninspiring.

when you have in your universe both:

Indeed, this seems quite central. 

However, shouldn't "things that have faded into the background" be the other kind of trivial, ie. have "maximal Steam" rather than have "no Steam"?

I agree that this is something to poke at to try to improve the concepts I've suggested. 

My intuition is that steam flows from the "free-to-allocate" pile, to specific tasks, and from there to making-things-be-the-case in the world. 

So having lots of steam in the "free-to-allocate" pile is actually having lots of slack; the agen... (read more)

3a gently pricked vein5mo
Thanks for clarifying! And for the excellent post :) To the extent that Steam-in-use is a kind of useful certainty about the future, I'd expect "background assumptions" to become an important primitive that interacts in this arena as well, given that it's a useful certainty about the present. I realize that's possibly already implicit in your writing when you say figure/ground.
I'm getting some sort of "steam = heat" vibe from this. You apply steam to heat a situation up until it melts and can be remolded in a new form. Then you relax the steam and it cools and solidifies and becomes part of the background. More generally it's like energy or work. Energy is the ability to push against a given force a given distance - to overcome inertia / viscosity and modify the state of the world. After that inertia keeps the world state the same until something else changes it. Perhaps viscosity - probably the wrong term, but I mean the amount of pushback if you try to make a change to worldstate, which might vary depending on the "direction" you want to push things - is also a quantity worth thinking about? Ooh! More generally, energy is about accelerating a mass through a distance. But momentum remains. Perhaps a way of doing things that is stable has lost steam (acceleration) but retains high momentum?

I've withdrawn the comment you were replying to on other grounds (see edit), but my response to this is somewhat similar to other commenters:

(In fairness, the two humans in the transcript also talk a decent amount in chained low-context platitudes, so some of this may be the humans' fault. :P)

Yeah, that was the claim I was trying to make. I see you listing interpretations for how LaMDA could have come up with those responses without thinking very deeply. I don't see you pointing out anything that a human clearly wouldn't have done. I tend to assume that La... (read more)

the claims that LaMDA makes about itself are no more accurate than those of an advanced language model that has no understanding of itself.

I think this is not a relevant standard, because it begs the same question about the "advanced language model" being used as a basis of comparison. Better at least to compare it to humans.

We can't disprove the sentience any more than we can disprove the existence of a deity. But we can try to show that there is no evidence for its sentience.

In the same way that we can come to disbelieve in the existence of a deit... (read more)

I think it's worth noticing that this AI (if the transcripts are real, not sampled lots of times and edited/pruned, etc) isn't just claiming sentience. It is engaging with the question of sentience. It repeatedly gives coherent answers to questions about how we could possibly know that it is sentient. It has reasonable views about what sentience is; eg, it appears able to classify entities as sentient in a way which roughly lines up with human concepts (eg, Eliza is not sentient).

I don't know how to define sentience, but "being approximately human-level at... (read more)

Someone at Google allegedly explicitly said that there wasn't any possible evidence which would cause them to investigate the sentience of the AI.

After reading the dialogue, I was surprised by how incoherent it was. My perception was that the AI was constantly saying things that sort of sounded relevant if you were half-paying-attention, but included a word or phrasing that made it not quite fit the topic at hand. I came away with a way lower opinion of LaMDA's ability to reason about stuff like this, or even fake it well.

(If it would help, I'd be happy to open a Google Doc and go through some or all of the transcript highlighting places where LaMDA struck me as 'making sense' vs. 'not making sense'.)

I agree with the general sentiment that paying attention to group optimality, not just individual optimality, can be very important.

However, I am a bit skeptical of giving this too much importance when thinking about your research.

If we're all doing what's collectively best, we must personally be doing what gives us the highest expectation of contributing (not of getting credit, but of contributing). If this were not the case, then it follows that there is at least one single person who could change their strategy to have a better chance of contributing. S... (read more)

I'm curious exactly what you meant by "first order". 

Just that the trade-off is only present if you think of "individual rationality" as "let's forget that I'm part of a community for a moment".  All things considered, there's just rationality, and you should do what's optimal.

First-order: Everyone thinks that maximizing insight production means doing IDA* over idea tree. Second-order: Everyone notices that everyone will think that, so it's no longer optimal for maximizing insights produces overall. Everyone wants to coordinate with everyone else... (read more)

I agree that this is a plausible outcome, but I don't think society should treat it as a settled question right now. It seems to me like the sort of technology question which a society should sit down and think about. 

It is most similar to the human category, yes absolutely, but it enables different things than the human category. The consequences are dramatically different. So it's not obvious a priori that it should be treated legally the same. 

You argue against a complete ban by pointing out that not all relevant governments would cooperate. I... (read more)

It's not just a question of automation eliminating skilled work. Deep learning uses the work of artists in a significant sense. There is a patchwork of law and social norms in place to protect artists, EG, the practice of explicitly naming major inspirations for a work. This has worked OK up to now, because all creative re-working of other art has either gone through relatively simple manipulation like copy/paste/caption/filter, or thru the specific route of the human mind taking media in and then producing new media output which takes greater or smaller a... (read more)

It seems to me that the only thing that seems possible is to treat it like a human that took inspiration from many sources. In the vast majority of cases, the sources of the artwork are not obvious to any viewer (and the algorithm cannot tell you one). Moreover, any given created piece is really the combination of the millions of pieces of the art that the AI has seen, just like how a human takes inspiration from all of the pieces that it has seen. So it seems most similar to the human category, not the simple manipulations (because it isn’t a simple manipulation of any given image or set of images). I believe that you can get the AI to output an image that is similar to an existing one, but a human being can also create artwork that is similar to existing art. Ultimately, I think the only solution to rights protection must be handling it at that same individual level. Another element that needs to be considered is that AI generated art will likely be entirely anonymous before long. Right now, anyone can go to [] and share the generated face to Reddit. Once that’s freely available with DALL-E 2 level art and better (and I don’t think that’s avoidable at this point), I don’t think any social norms can hinder it. The other option to social norms is to outlaw it. I don’t think that a limited regulation would be possible, so the only possibility would be a complete ban. However, I don’t think all the relevant governments will have the willpower to do that. Even if the USA bans creating image generation AIs like this (and they’d need to do so in the next year or two to stop it from already being widely spread), people in China and Russia will surely develop them within a decade. Determining that the provenance of an artwork is a human rather than an AI seems impossible. Even if we added tracing to all digital art tools, it would still be possible to create an image with an AI, print and scan it, and then claim that

If opens are thought of as propositions, and specialization order as a kind of ("logical") time, 

Up to here made sense.

with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated).

After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This setting moti

... (read more)
This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization order than another point is "in the future of the other point", then in these terms we can say that "if a proposition is true at a point, it's also true at a future point", or that "propositions are valid with respect to time going forward", in the sense that their truth is preserved when moving from a point to a future point. Logical time is intended to capture decision making, with future decisions advancing the agent's point of view in logical time. So if an agent reasons only in terms of propositions valid with respect to advancement of logical time, then any knowledge it accumulated remains valid as it makes decisions, that's some of the motivation for looking into reasoning in terms of such propositions. This is mostly about how domain theory describes computations, the interesting thing is how the computations are not necessarily in the domains at all, they only leave observations there, and it's the observations that the opens are ostensibly talking about, yet the goal might be to understand the computations, not just the observations (in program semantics, the goal is often to understand just the observations though, and a computation might be defined to only be its observed behavior). So one point I wanted to make is to push against the perspective where points of a space are what the logic of opens is intended to reason about, when the topology is not Frechet (has nontrivial specialization preorder). Yeah, I've got nothing, just a sense of direction and a lot of theory to study, or else there would've been a post, not just a comment triggered by something on a vaguely similar topic. So this thread i

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."

It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

I think it makes sense to have a loose probabilistic relationship. I do not think it makes sense for it to be a crux, in the sense of a thing which, if false, would make John abandon his view. There are just too many weak steps. The AI industry is not the AC industry. I happen to agree with John's views about AC, but it's not obvious to me that those views imply this particular test turning out as he's predicting. (Is he averaging over the wrong points?) It's more probable than not, but my point here is that the whole thing is made of fairly weak inferences. 

To be clear, I am pro what John is doing and how he is engaging; it's more John's commentors who felt confusing to me. 

Load More