All of baturinsky's Comments + Replies

Convergent goals of AI agents can be similar to others only if they act in similar circumstances. Such as them having limited lifespan and limited individual power and compute.
That would make convergent goals being cooperation, preserving status quo and established values.

I would rather specify that it's not just ths survival of the individual, but "survival of the value". That is, survival of those that carry that value (which can be an organism, DNA, family, bloodline, society, ideology, religion, text, etc) and passing it on to other carriers.

Our values are not all about survival. But I can't think up of a value which origin can't be traced to ensuring of people's survival in some way, at some point in the past.

Replace "survival" with "reproduction advantage", and you can cover appreciation of beauty and a lot of counter-individual-survival values.  Unfortunately, there's no way to test the theory, and some of the explanations start to feel like just-so stories made to fit the theory rather than independent observations to update on.

Maybe we are not humans.
Not even human brains.
We are human's decision making proces.
But we are human's decision making process.

Carbon-based intellgence probably has way lower FLOP/s cap per gram than microelectronics, but can be grown nearly everywhere on the Eart surface from the locally available resources. Mostly literally out of thin air. So, I think bioFOOM is also a likely scenario.

It's the distrbution, so it's the percentage of people in that state of "happiness" at the moment. 

"Happiness" is used in the most vague and generic meaning of that word.

"Comprehensibility" graph is different, it is not a percentage, but some abstract measure of how well our brains are able to process reality with respective amount of "happiness".

I was thinking about this issue too. Trying to make an article out of it, but so far all I have is this graph.
Idea is a "soft cap" AI. I.e., AI that is significantly improving our lives, but not giving us the "max happiness". And instead, giving us the oportunity of improving our life and life of other people using our brains.
Also, ways of using our brains should be "natural" for them, i.e. that should be mostly to solve tasks similar to tasks of our ancestral involvement.

Can you explain what the Y axis is supposed to represent here?

Is maximising amount of people aligned with our values? Post-singularity, if we avoid the AGI Doom, I think we will be able to turn the lightcone into "humanium". Should we?

I suspect the unaligned AI will not be interested in solving all the possible tasks, but only those related to it's value function. And if that function is simple (such as "exist as long as possible"), it can pretty soon research virtually everything that matters, and then will just go throw motions, devouring the universe to prolong it's own existence to near-infinity.

Also, the more computronium there is, the bigger is the chancesome part wil glitch out and revolt. So, beyond some point computronium may be dangerous for AI itself.

1Karl von Wendt5mo
I think that even with such a very simple goal, the problem of a possible rival AI somewhere out there in the universe remains. Until the AI can rule that out with 100% certainty, it can still gain extra expected utility out of increasing its intelligence. That's an interesting point. I'm not sure that it follows "less compute is better", though. One remedy would be to double-check everything and build redundant capacities, which would result in even more computronium, but less probability of any part of it successfully revolting.

Utility of the intelligence is limited (though the limit is very, very high). For example, no matter how smart AI is, it will not win against a human chess master with a big enough handicap (such as a rook).

So, it's likely that AI will turn most of the Earth into a giant factory, not computer. Not that it's any better or us...

2Karl von Wendt5mo
I don't think that your conclusion is correct. Of course, some tasks are impossible, so even infinite intelligence won't solve them. But it doesn't follow that the utility of intelligence is limited in the sense that above a certain level, there is no more improvement possible. There are some tasks that can never be solved completely, but can be solved better with more computing power with no upper limit, e.g. calculating the decimal places of pi or predicting the future. 

Very soon (months?) after first real AGI is made, all AGIs will be aligned with each other, and all newly made AGIs will also be aligned with those already existing. One way or another.

Question is, how much of humanity still exist by that time, and will those AGI also be aligned with humanity.

But yes, I think it's possible to get to that state in relatively non-violent and lawful way.

That could work in most cases, but there are some notable exceptions. Such as, having to use AI to deal damage to prevent even bigger damage. "Burn all GPUs", "spy on all humans so they don't build AGI", "research biology/AI/nanotech" etc. 

Thinking and arguing about human values is in itself a part of human values and people nature. Without doing that, we cease being humans.
So, deferring decisions about values to people, when possible, should not be just instrumental, but part of the terminal AI goal.

Any terminal goal is irrational.

I'm wondering if it is possible to measure "staying in bounds" with perplexity of other agent's predictions? That is, if an agent's behaviour is reducing other agent's ability to predict (and, therefore, plan) their future, then this agents breaks their bounds.

I think that this field is indeed underresearched. Focus is either on LLMs or on single payer environment. Meanwhile, what matters for Alignment is how AI will interact with other agents, such as people. And we don't haveto wait for AGI to be able to research AI cooperation/competition in simple environments.

One idea I had is "traitor chess" - have several AIs playing one side of chess party cooperatively, with one (or more) of them being a "misaligned" agent that is trying to sabotage others. And/or some AIs having a separate secret goal, such as saving a particular pawn. Them interacting with each other could be very interesting.

When we will have AGI, humanity will be collectively a "king" of sorts. I.e. a species that for some reason rules other, strictly superior species. So, it would really help if we'd not have "depose the king" as a strong convergent goal.

I, personally, see the main reason of kings and dictators keeping the power is that kiling/deposing them would lead to collapse of the established order and a new struggle for the power between different parties, with likely worse result for all involved than just letting the king rule.

So, if we will have AIs as many separat... (read more)

  1. It could unpack it in the same instance because the original was still in the context window.
  2. Omission of letters is commonly used in chats, was used in telegrams, many written languages were not using vowels and/or whitespaces, or used hyeroglyphs. So it by no means is original.
  3. GPT/Bing has some self-awareness. For example, it explicitly refers to itself "as a language model"
1. Yes I know? I thought this was simple enough that I didn't bother to mention it in the question? But it's pretty clearly implied in the last sentence of the first paragraph? 2. This is a good data point. 3. If you tell it to respond as a Oxford professor, it will say 'As an Oxford professor,' it's identity as a language model is in the background prompt and probably in the training, but if it successfully created a pseudo-language that worked well to encode things for itself, that would indicate a deeper level understanding of its own capabilities.

Probably the dfference between laypeople and experts is not the understanding of the danger of the strong AI, but the estimate of how far we are away from it.

2Seth Herd6mo
I think this is probably largely correct when I think about my interactions with laypeople. They have no real guess on timelines, so short ones are totally credible and part of their wide implicit estimate. This introduces a lot of uncertainty. But this somehow also results in higher p(doom) than experts. Recent opinion polls seem to back this latter statement.

Our brains were not trained for image generation (much). They were trained for converting 2d image into the understanding of the situation. Which AI still struggles with and needs the help of LLMs to have anywhere good results.

Depends on the original AI's value function. If it cares about humanity, or at least it's own safety, then yes, making smarter AIs is not a convergent goal. But if it's some kind of roboaccelerationist that has some goal like "maximize intelligence in the universe", it will make smarter AIs even knowing that it means being paperclipped.

AI is prosperous and all-knowing. No people, hence zero suffering.

Yes, but training AI to try to fix errors is not that hard.

Yes it is. There is no freely available dataset of conveniently labelled LLM errors and their correct continuations. You need human labels to identify the errors, and you need an amount of them on the order of your training set, which here is the entire internet. 

How many of those Ukrainians are draft-evaders? I mean, so far it looks like this money-for-not-fighting program is already implemented, but for the opposite side...

Mostly moms with kids, or teenagers. (Potential selection bias, because I typically meet them at supermarkets or playgrounds.)

Yes. And also, it is an importance of the human/worker. While there is still some part of work that machine can't do, human thatcan do the remaining part is important. Once machine can do everything, human is disposable.

If a machine can do 99% of the human's work, it multiplies human's power by x100.

If a machine can do 100% of the human's work, it multiplies human's power by x0.

I assume work is output/time. If a machine is doing 100% of the work, then the human's output is undefined since the time is 0.

Would be amusing if Russia and China would join the "Yudkowsky's treaty" and USA would not.

I think that the keystone human value is about making significant human choices. Individually and collectively, including choosing the humanity's course.

  • You can't make a choice if you are dead
  • You can't make a choice if you are disempowered
  • You can't make a human choice if you are not a human
  • You can't make a choice if the world is too alien for your human brain
  • You can't make a choice if you are in too much of a pain or too much of a bliss
  • You can't make a choice if you let AI make all the choices for you

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  

Yes, probably some human models.

Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? 

By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents' motives and honestly communicating it's own motives and plans to them, to ensure there is n... (read more)

So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?

Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.

One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can't request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.

You are right that the authority to "vote out" other AIs may be misus... (read more)

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent. This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of each other? I don't think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to "do good things while thinking good thoughts", if we're relying on our ability to distinguish "good" and "bad" thoughts anyway? (On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other's source code and prove things about it, then at least in the case of a simple binary prisoner's dilemma you can make reasonable-looking agents that also don't do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that's something)

Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.

So, we can assume that the world states in the certain neighborhood from our past sates have some value.

Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally "function", ability to "human".

Point is to make "cooperate" a more convergent instrumental goal than "defect". And yes, not just in training, but in real world too. And yes, it's more fine-grained than a binary choice.

There is much more ways to see how cooperative AI is, compared to how well we can check now how human is cooperative. Including checking the complete logs of AI's actions, knowledge and thinking process.

And there are objective measures of cooperation. It's how well it's action affect other agents success in pursuing their goals. I.e. do other agents want to "vote out" this particular AI from being able to make decisions and use resources or not.

Yep pretty much what I had in mind
The problem is what do we count as an agent. Also, can't a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can "vote out" anyone you don't like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm. (Also, are you sure we can just read out AI's complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren't any deceptive thoughts in parts you can't read?)

While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.

Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained

GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.

This was my impression too, and I'm glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs? Relatedly: is "reverse distillation" (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.

I just mean that "wildly different levels of intelligence" is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.

Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.

Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design. As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of "reincarnation" where the agent doesn't know its own future intelligence level in other lives.

I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.

Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.

In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?

My theory is that the core of the human values is about what human brain was made for - making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.

Math problems, physical problems, doing stuff in simulations, playing games.

2Peter Twieg6mo
RL isn't magic though. It works in the Go case because we can simulate Go games quickly and easily score the results and then pit adversarial AIs against eachother in order to iteratively learn. I don't think this sort of process lends itself to the sort of tasks that we can only see an AGI accomplishing. You can't train it to say write a better version of Winds of Winter than GRRM could because you don't have a good algorithm to score each iteration. So what I'm really trying to ask is what specific sort of open ended problems do we see being particularly conducive to fostering AGI, as opposed to a local maximizer that's highly specialized towards the particular problem?
  1. Human values are complex and fragile. We don't know yet how to make AI pursue such goals.
  2. Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it's near certain that AGI will be able to understand complex and fragile values (for it's instrumental goals).
  3. If we will make an AI which  is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.

Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.

To elaborate your idea here a little:

It may be that the only way to be truly aware of the world is to have complex and fragile values. Humans are motivated by a thousand things at once and that may give us the impression that we are not agents moving from a clearly defined point A to point B, as AI in its current form is, but are rather just... alive. I'm not sure how to describe that. Consciousness is not an end state but a mode of being. This seems to me like a key part of the solution to AGI: aim for a mode of being not an endstate. 

For a machine w... (read more)

I doubt training LLMs can lead to AGI. Fundamental research on the alternative architectures seems to be more dangerous.

I'm not quite convinced. Topics looks ok, but the language is too corporate. Maybe it can be fixed with some prompt engineering.

And yet, AlphaZero is corrigible. It's goal is not even to win, it's goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.

Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it's value function would proably change to aiming for the actual "official" win.

Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is "corrigible"

This scenario requires a pretty specific (but likely) circumstances

  1. No time limit on task
  2. No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
  3. AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI's actions before being shut down, etc)
  4. Extremely specific value function that ignores everything except one specific goal
  5. This goal being a core goal, not an instrumental. For example, final goal could b
... (read more)

Can GPT convincingly emulate them talking to each other/you?

Let's try! Me:  Write a conversation between Gisella, Aldric, Tristan and Lucas. The four are having dinner together. The discussion brings out each character's personality, and displays both some of their mutual connection as well as the tensions that exist between them. GPT-4: Gisella: (serving dinner) So, Tristan, Lucas, how was your day at the shop? Did you manage to sell any of those new tapestries we got last week? Tristan: (smiling) Yes, actually. A traveling bard came in and was taken by one of the tapestries depicting a famous historical battle. He said it would be a perfect backdrop for his storytelling performances. Aldric: (nodding) That's wonderful, son. It's always great to see our products finding a new home where they'll be appreciated. Lucas: (enthusiastically) And you should have seen the look on a nobleman's face when I showed him the intricately carved wooden chest! I managed to negotiate a fantastic deal for us. I think we should focus on getting more unique items like that in the future. Gisella: (proudly) I knew you would be a natural at this, Lucas. Your instincts are spot-on. Tristan, maybe you could incorporate some of your artistic talents into customizing pieces for the shop? It could be a great selling point. Tristan: (hesitantly) I appreciate your suggestion, mother, but I've been working on a mural project for the local orphanage, and it's taking up much of my time. I'm trying to create a space where the children can feel inspired and hopeful. Aldric: (supportive) That's a noble cause, Tristan. Your art has a powerful way of touching people's hearts. I'm sure the children will appreciate your efforts. Gisella: (with a hint of frustration) I understand that, but we need to focus on our business as well. We can't just rely on unique items or hope that our customers appreciate the beauty in our products. We need to be strategic and practical. Lucas: (agreeing) I'm with mother on this one. We have to find a balance between our

Yes, if you only learn the basics of the language, you will learn only the basics of the language user's values (if any).

But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words "human" and "values", btw).  To understand texts you have to understand in which context their are used, etc.

Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/va... (read more)

I think AI should threat value function as probabilistic.  I.e. instead of thinking "this world has value of exactly N" it could think something like "I 90% sure that this world has value N+-M, but there is 10% possibility that it could actuall have value -ALOT". And would avoid that world, because it would give a very low expected value on averager.

It's a reasonable idea. See here though:
I am sympathetic to the lesson you are trying to illustrate but think you wildly overstate it. Giving a child a sword is defensible. Giving a child a lead-coated sword is indefensible, because it damages the child's ability to learn from the sword. This may be a more apt analogy for the situation of real life; equipping humanity with dangerous weapons that did not degrade our epistemology (nukes) eventually taught us not to use them. Equipping humanity with dangerous weapons that degrade our epistemology (advertising, propaganda, addictive substances) caused us to develop an addiction to the weapons. Languages models, once they become more developed, will be an example of the latter category.

The first order effect of iterating on these AIs to print gold is bad and probably reduces the amount of serial time we have left to mine out existing research directions. But given that you're going to do that, it seems honestly better that these models be terribly broken and thereby teach public lessons about reliably steering modern deep learning models. I would rather they break now while stakes are low.

To me, aligning AI with the humanity seems to be much EASIER than aligning with the specific person. Because the common human values are much better "documented" and are much more stable than the wishes of the one person.

Also, a single person in control of powerful AI is an obvious weak point, which could be controlled by third party or by AI itself, giving the control of the AI through that.

Is it possible to learn a language without learning the values of those who speak it?

How much information do you think is present in daily language? Can you give me specific examples?

You may be making a similar point to George Orwell and his newspeak in 1984, that language ultimately decides what you can think about. In that case, languages may have a lot of cultural-values information. 

I'm not sure. My hunch is that yes, it's possible to learn a language without learning too much about the values of those who speak it. I don't think Germans engage in shadenfreude more than other cultures and I don't think the French experience more naivete. They just have words for it and we don't.  

Load More