All of Jan_Kulveit's Comments + Replies

It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)

In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two so... (read more)

As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable". 

7Daniel Kokotajlo3d
I agree that logodds space is the right way to think about how close probabilities are. However, my epistemic situation right now is basically this: "It sure seems like Doom is more likely than Safety, for a bunch of reasons. However, I feel sufficiently uncertain about stuff, and humble, that I don't want to say e.g. 99% chance of doom, or even 90%. I can in fact imagine things being OK, in a couple different ways, even if those ways seem unlikely to me. ... OK, now if I imagine someone having the flipped perspective, and thinking that things being OK is more likely than doom, but being humble and thinking that they should assign at least 10% credence (but less than 20%) to doom... I'd be like "what are you smoking? What world are you living in, where it seems like things will be fine by default but there are a few unlikely ways things could go badly, instead of a world where it seems like things will go badly by default but there are a few unlikely ways things could go well? I mean I can see how you'd think this is you weren't aware of how short timelines to ASI are, or if you hadn't thought much about the alignment problem..." If you think this is unreasonable, I'd be interested to hear it!   
This seems to violate common sense. Why would you think about this in log space? 99% and 1% are identical in if(>0) space, but they have massively different implications for how you think about a risk (just like 20 and 70% do!)

Judging in an informal and biased way, I think some impact is in the public debate being marginally a bit more sane - but this is obviously hard to evaluate. 

To what extent more informed public debate can lead to better policy is to be seen; also, unfortunately, I would tend to glomarize over discussing the topic directly with policymakers. 

There are some more proximate impacts like we (ACS) are getting a steady stream of requests for collaboration or people wanting to work with us, but we basically don't have capacity to form more collaborations, and don't have capacity to absorb more people unless exceptionally self-guided. 

It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3.  Someone can try that with ' petertodd' and GPT-J. Or,  you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now).

I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense

davinci-instruct-beta, T=0:

Add more examp... (read more)

GPT-J doesn't seem to have the same kinds of ' petertodd' associations as GPT-3. I've looked at the closest token embeddings and they're all pretty innocuous (but the closest to the ' Leilan' token, removing a bunch of glitch tokens that are closest to everything is ' Metatron', who Leilan is allied with in some Puzzle & Dragons fan fiction). It's really frustrating that OpenAI won't make the GPT-3 embeddings data available, as we'd be able to make a lot more progress in understanding what's going on here if they did.

I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

Some speculative hypotheses, one more likely and mundane, one more scary, one removed

1. Nature of embeddings

Do you remember word2vec (Mikolov et al) embeddings? 

Stuff like (woman-man)+king = queen works in embeddings vector space.

However, the vector (woman-man) itself does not correspond to a word, it's more something like "the contextless essence of femininity". Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).

Note such vecto... (read more)

Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.

It would be enlightening to rerun the above experiments with different choices of v:

  • A random vector (say, iid Gaussian )
  • A random sparse vector
  • (apple+banana)/2
  • (villain-hero)+0.1*(bitcoin dev)


Thanks for the links!

What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'. 

I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

4Charlie Steiner1mo
Fair enough.

I don't this the self-alignment problem depends of notion of 'human values'. Also I don't think the "do what I said" solves it. Do what I said is roughly "aligning with the output of the aggregation procedure", and

  • for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
  • often there will be a tension between your words; strictly interpreted "do not do damage" can mean "do nothing" - basically anything has some risk of some damage; when you tell a LLM to be "harmless" and "helpful", these requests point in different directions
  • strong learners will learn what lead you to say the words anyway
I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don't want to drink. The idea of "desires" may be a better fit than "human values" as it is normal for desires to evolve and contradict each other.  But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink.  Speaking about you bullet points: everything depends of the situation and there are two main types of  situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations. 

Note that this isn't exactly the hypothesis proposed in the OP and would point in a different direction.

OP states there is a categorical difference between animals and humans, in the ability of humans to transfer data to future generation. This is not the case, because animals do this as well.

What your paraphrase of Secrets of Our Success is suggesting is this existing capacity for transfer of data across generations is present in many animals, but there is some threshold of 'social learning' which was crossed by humans - and when crossed, lead to cultural... (read more)

  There doesn't need to be a categorical difference, just a real difference that is strong enough to explain humanities sharp left turn by something other than increased brain size. I do believe that's plausible - humans are much much better than other animals at communicating abstractions and ideas accross generations. Can't speak about the book, but X4vier's example would seem to support that argument. 

Thanks for the comment.

I do research on empirical agency and it's still surprises me how little the AI-safety community touches on this central part of agency - namely that you can't have agents without this closed loop.  

In my view it's one of the results of AI safety community being small and sort of bad in absorbing knowledge from elsewhere - my guess is this is in part a quirk due to founders effects, and also downstream of incentive structure on platforms like LessWrong.

But please do share this stuff.

I've been speculating a bit (mostly to myself)

... (read more)

This whole just does not hold.

(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

This is clearly false. GPT4, can you explain? :

While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:

  1. Epigenetics: Epigenetic modifications involve changes in gene expression that do
... (read more)
4Quintin Pope1mo
I don't think this objection matters for the argument I'm making. All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information. Thus, the enormous disparity between the brain's with-lifetime learning versus evolution cannot lead to a multiple OOM faster accumulation of capabilities as compared to evolution. When non-genetic cross-generational channels are at saturation, the plot of capabilities-related info versus generation count looks like this: with non-genetic information channels only giving the "All info" line a ~constant advantage over "Genetic info". Non-genetic channels might be faster than evolution, but because they're saturated, they only give each generation a fixed advantage over where they'd be with only genetic info. In contrast, once the cultural channel allows for an ever-increasing volume of transmitted information, then the vastly faster rate of within-lifetime learning can start contributing to the slope of the "All info" line, and not just its height. Thus, humanity's sharp left turn.
It seems to me that the key threshold has to do with the net impact of meme replication: * Transmitting a meme imposes some constraint on the behaviour of the transmitting system. * Transmitting a meme sometimes benefits the system (or descendants). Where the constraint is very limiting, all but a small proportion of memes will be selected against. The [hunting technique of lions] meme is transferred between lions, because being constrained to hunt is not costly, while having offspring observe hunting technique is beneficial. This is still memetic transfer - just a rather uninteresting version. Humans get to transmit a much wider variety of memes more broadly because the behavioural constraint isn't so limiting (speaking, acting, writing...), so the upside needn't meet a high bar. The mechanism that led to hitting this threshold in the first place isn't clear to me. The runaway behaviour after the threshold is hit seems unsurprising. Still, I think [transmission became much cheaper] is more significant than [transmission became more beneficial].

I think OP is correct about cultural learning being the most important factor in explaining the large difference in intelligence between homo sapiens and other animals.

In early chapters of Secrets of Our Success, the book examines studies comparing performance of young humans and young chimps on various congnitive tasks. The book argues that across a broad array of cognitive tests, 4 year old humans do not perform singificantly better than 4 year old chimps on average, except in cases where the task can be solved by immitating others (human children crushe... (read more)

Mostly yes, although there are some differences.

1. humans also understand they constantly modify their model - by perceiving and learning - we just usually don't use the world 'changed myself' in this way
2. yes, the difference in human condition is from shortly after birth we see how our actions change our sensory inputs - ie if I understand correctly we learn even stuff like how our limbs work in this way. LLMs are in a very different situation - like, if you watched thousands of hours of video feeds about e.g. a grouphouse, learning a lot about how the i... (read more)

This seems the same confusion again.

Upon opening your eyes, your visual cortex is asked to solve a concrete problem no brain is capable or expected to solve perfectly: predict sensory inputs.  When the patterns of firing don't predict the photoreceptor activations, your brain gets modified into something else, which may do better next time. Every time your brain fails to predict it's visual field, there is a bit of modification, based on computing what's locally a good update.

There is no fundamental difference in the nature of the task. 

Where the... (read more)

2Max H2mo
No, but I was able to predict my own sensory input pretty well, for those 5 minutes. (I was sitting in a quiet room, mostly pondering how I would respond to this comment, rather than the actual problem you posed. When I closed my eyes, the sensory prediction problem got even easier.) You could probably also train a GPT on sensory inputs (suitably encoded) instead of text, and get pretty good predictions about future sensory inputs. Stepping back, the fact that you can draw a high-level analogy between neuroplasticity in human brains <=> SGD in transformer networks, and sensory input prediction <=> next token prediction doesn't mean you can declare there is "no fundamental difference" in the nature of these things, even if you are careful to avoid the type error in your last example. In the limit (maybe) a sufficiently good predictor could perfectly predict both sensory input and tokens, but the point is that the analogy breaks down in the ordinary, limited case, on the kinds of concrete tasks that GPTs and humans are being asked to solve today. There are plenty of text manipulation and summarization problems that GPT-4 is already superhuman at, and SGD can already re-weight a transformer network much more than neuroplasticity can reshape a human brain.

I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error. 

- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less diff... (read more)

I'd really like to see Eliezer engage with this comment, because to me it looks like the following sentence's well-foundedness is rightly being questioned. While I generally agree that powerful optimizers are dangerous, the fact that the GPT task and the "being an actual human" task are somewhat different has nothing to do with it.

While the claim - the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence - is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong. 

Let’s start from the end and try to figure out what goes wrong.

GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from a

... (read more)
3Max H2mo
  Yes, human brains can be regarded as trying to solve the problem of minimizing prediction error given their own sensory inputs, but no one is trying to push up the capabilities of an individual human brain as fast as possible to make it better at actually doing so. Lots of people are definitely trying this for GPTs, measuring their progress on harder and harder tasks as they do so, some of which humans already cannot do on their own.  Or, another way of putting it: during training, a GPT is asked to solve a concrete problem no human is capable of or expected to solve. When GPT fails to make an accurate prediction, it gets modified into something that might do better next time. No one performs brain surgery on a human any time they make a prediction error.
9Eliezer Yudkowsky2mo
I didn't say that GPT's task is harder than any possible perspective on a form of work you could regard a human brain as trying to do; I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

I don't mind the post was posted without much editing or work put into formatting but I find it somewhat unfortunate the post was probably written without any work put into figuring out what other people wrote about the topic and what terminology they use

Recommended reading: 
- Daniel Dennett's Intentional stance
- Grokking the intentional stance
- Agents and device review

3the gears to ascension2mo
@Audere [] Thoughts on changing words to match previous ones?

This is great & I strongly endorse the program 'let's figure out what's the actual computational anatomy of human values'. (Wrote a post about it few years ago - it wasn't that fit in the sociology of opinions on lesswrong then).

Some specific points where I do disagree

1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older ... (read more)

Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards -- i.e. as social status gives direct dopamine hits. Secondly, they shape the memetic selection environment which creates and evolves linguistic memes of values. However, it's important to note that almost all of these drives such as for social status are mediated through linguistic cortical abstractions. I.e. people will try to get social status by fulfilling whatever the values of their environment are, which can lead to very different behaviours being shown and rewarded in different environments, even though powered by the same basic drive.  The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?  This is definitely true for humans but it is unclear that this is necessarily bad. This is at least somewhat aligned and this is how any kind of intrinsic motivation to external goals has to work -- i.e. the external goal gets supported by and channels an intrinsic motivation.  Yeah,  in the post I say I am unclear as to whether this is stable under reflection. I see alignment techniques that would follow from this as being only really applicable to near-term systems and not under systems undergoing strong RSI. Similarly. 

I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.

I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.  

- pivotal acts vs. pivotal processes
-- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal a... (read more)

Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.

Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of pro... (read more)

I am interested in examples of non-empirical (theoretically based) deep learning progress.
6Jakub Kraus2mo
Can you elaborate on why you think this is false? I'm curious.
6Jakub Kraus2mo
On a related note, this part might be misleading: I think earlier forms of this research focused on developing new, alignable algorithms, rather than aligning existing deep learning algorithms. However, a reader of the first quote might think "wow, those people actually thought galaxy-brained decision theory stuff was going to work on deep learning systems!" For more details, see Paul Christiano's 2019 talk on "Current work in AI alignment" []:

To be clear we are explicitly claiming it's likely not the only pressure - check footnotes 9 and 10 for refs.

On the topic thinking about it for yourself and posting further examples as comments...

This is GPT4 thinking about convergent properties, using the post as a prompt and generating 20 plausibly relevant convergences. 

  • Modularity: Biological systems, like the human brain, display modularity in their structure, allowing for functional specialization and adaptability. Modularity is also found in industries and companies, where teams and departments are organized to handle specific tasks.
  • Hierarchical organization: In biological systems, hierarchical organiz
... (read more)
  1. The concrete example seems just an example of scope insensitivity?
  2. I don't trust the analysis of point (2) in the concrete example. It seems plausible turning off the wifi has a bunch of second-order effects, like making various wifi-connected devices not to run various computations happening when online. Consumption of the router could be smaller part of the effect. It's also quite possible this is not the case, but it's hard to say without experiments.
  3. I would guess most aspiring rationalists are prone to something like an "inverse salt in pasta water fall
... (read more)
the second-order effects of turning off the WiFi are surely comprised of both positive and negative effects, and i have no idea which valence it nets out to. these days homes contain devices whose interconnectedness is used to more efficiently regulate power use. for example, the classic utility-controlled water heater, which reduces power draw when electricity is more expensive for the utility company (i.e. when peakers would need to come online). water heaters mostly don’t use WiFi but thermostats like Nest, programmable light bulbs, etc do: when you disrupt that connection, in which direction is power use more likely to change? i have my phone programmed so that when i go to bed (put it in the “i’m sleeping” Do Not Disturb mode) it will automatically turn off all the outlets and devices — like my TV, game consoles, garage space heater — which i only use during the day. leaving any one of these on for just one night would cancel weeks of gains from disabling WiFi.
I’m actually more interested in reverse salted pasta water fallacies! It’s an extension of Chesterton’s Fence, but with an added twist where the fence has a sign with an incorrect explanation of what the fence is for. In the language of Bayes, we might think of it thus: E = The purpose of the fence stated on the sign H = The actual purpose of the fence is the stated one The fallacy would simply be ascribing too high a prior value of P(H), at least within a certain reference class. Potentially, that might be caused by rationalists tailoring their environment such that a high P(H) is a reasonable default assumption almost all the time in their lived experience. Think a bunch of SWE autistics who hang out with each other and form a culture centered on promoting literal precise statements. For them, P(H) is very high for the signs they’ll typically encounter. The failure mode is not noticing when you’re outside the samples distribution of life circumstances - when you’re thinking about the world outside rationalist culture. It may not have a lower truth content, but it may have lower literalism. But there’s not always a clear division between them, and anyway, insisting on literalism and interrogating the false premises of mainstream culture is a way of expanding the boundaries of rationalist culture.

Translating it to my ontology:

1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough cap... (read more)

I don't think in this case the crux/argument goes directly through "the powerful alignment techniques" type of reasoning you describe in the "hopes for alignment".

The crux for your argument is the AIs  - somehow - 
a. want, 
b. are willing to and 
c. are able to coordinate with each other. 

Even assuming AIs "wanted to", for your case to be realistic they would need to be willing  to, and able to coordinate. 

Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other? 

My... (read more)

I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants". 

How this work: currently, the training set does not contain many "conversations with helpful AI assistants".  "ChatGPT" is likely mostly not the protagonist in the stories it is trained on.  As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.

If you train on data where "the ChatGPT... (read more)

Seems a bit like too general counterargument against more abstracted views?

1. Hamiltonian mechanics is almost an unfalsifiable tautology
2. Hamiltonian mechanics is applicable to both atoms and starts. So it’s probably a bad starting point for understanding atoms
3. It’s easier to think of a system of particles in 3d space as a system of particles in 3d space, and not as Hamiltonian mechanics system in an unintuitive space
4. Likewise, it’s easier to think of systems involving electricity using simple scalar potential and not the bring in the Hamiltonian
5. It... (read more)

I agree that calling a representation an 'unfalsifiable tautology' is a type error. Representation problems are ontology, not epistemology. The relevant questions are whether they allow some computation that was intractable before.
3Steven Byrnes3mo
In the OP, I wrote: I think Hamiltonian mechanics passes that test. If my friend says that Hamiltonian mechanics is stupid and they don’t want to learn it or think about it ever, and then my friend spends some time trying to answer practical questions about practical physics systems, they will “trip over” pretty much every aspect of Hamiltonian mechanics in the course of answering those questions. (Examples include “doing almost anything in quantum mechanics” and “figuring out what quantities are conserved in a physical system”.) The real crux is I think where I wrote: “I have yet to see any concrete algorithmic claim about the brain that was not more easily and intuitively [from my perspective] discussed without mentioning FEP.” Have you? If so, what? If somebody says “As a consequence of Hamiltonian mechanics, stars burn hydrogen”, we would correctly recognize that claim as nonsense. Hamiltonian mechanics applies to everything, whereas stars burning hydrogen is a specific contingent hypothesis that might or might not be true. That’s how I feel when I read sentences like “In order to minimize free energy, the agent learns an internal model of potential states of the environment.” [] Maybe the agent does, or maybe it doesn’t! The free energy principle applies to all living things by definition, whereas “learning an internal model of potential states of the environment” is a specific hypothesis about an organism, a hypothesis that might be wrong. For example (as I mentioned in a different comment), imagine a very simple organism that evolved in an extremely, or even perfectly, homogeneous environment. This organism won’t evolve any machinery for learning (or storing or querying) an internal model of potential states of the environment, right? Sure, that’s not super realistic, but it’s certainly possible in principle. And if it happened, the FEP would apply to that organism just like every other organism, right? So the sentence abov

A highly compressed version of what the disagreements are about in my ontology of disagreements about AI safety...

  • crux about continuity; here GA mostly has the intuition "things will be discontinuous" and this manifests in many guesses (phase shifts, new ways of representing data, possibility to demonstrate overpowering the overseer, ...); Paul assumes things will be mostly continuous, with a few exceptions which may be dangerous
    • this seems similar to typical cruxes between Paul and e.g. Eliezer (also in my view this is actually decent chunk of disagreement
... (read more)

I'm not really convinced by the linked post 
- the chart is from a someone selling financial advice and illustrated elo ratings of chess programs differ from e.g. wikipedia ("Stockfish estimated Elo rating is over 3500") (maybe it's just old?)
- linked interview in the "yes" answer is from 2016
- elo ratings are relative to other players;  it is not trivial to directly compare cyborgs and AI: engine ratings are usually computed in tournaments where programs run with same hardware limits

In summary,  in my view in something like "correspondence c... (read more)

1Jakub Kraus3mo
That makes sense. My main question is: where is the clear evidence of human negligibility in chess? People seem to be misleadingly confident about this proposition (in general; I'm not targeting your post). When a friend showed me the linked post, I thought "oh wow that really exposes some flaws in my thinking surrounding humans in chess." I believe some of these flaws came from hearing assertive statements from other people on this topic. As an example, here's Sam Harris during his interview with Eliezer Yudkowsky (transcript [], audio []): (In retrospect, this is a very weird assertion. Fifteen days? I thought he was talking about Go, but the last sentence makes it sound like he's talking about chess.)

Yes, the non-stacking issue in the alignment community is mostly due to the nature of the domain

But also partly due to the LessWrong/AF culture and some rationalist memes. For example, if people had stacked on Friston et. al., the understanding of agency and predictive systems (now called "simulators") in the alignment community could have advanced several years faster. However, people seem to prefer reinventing stuff, and formalizing their own methods. It's more fun... but also more karma.

In conventional academia, researchers are typically forced to stack... (read more)

2[comment deleted]3mo

Thanks for the comment. I haven't noticed your preprint before your comment, but it's probably worth noting I've described the point of this post in a facebook post on 8th Dec 2022; this  LW/AF post is just a bit more polished and referenceable. As your paper had zero influence on writing this, and the content predates your paper by a month,  I don't see a clear case for citing your work.

Mostly agree - my gears-level model is the conversations listed tend to hit Limits to Legibility constraints, and marginal returns drop to very low.

For people interested in something like "Double Crux" on what's called here "entrenched views", in my experience what has some chance of working is getting as much understanding as possible in one mind, and then attempting to match the ontologies and intuitions. (I had some success in this and "Drexlerian" vs "MIRIesque" views)

The analogy I had in mind is not so much in exact nature of the problem, but in the aspect it's hard to make explicit precise models of such situations in advance.  In case of nukes, consider the fact that smartest minds of the time, like von Neumann or Feynman, spent decent amount of time thinking about the problems, had clever explicit models, and were wrong - in case of von Neumann to the extent that if US followed his advice, they would have launched nuclear armageddon.

One big difference is GoF currently does not seem that dangerous to governments. If you look on it from a perspective not focusing on the layer of individual humans as agents, but instead states, corporations, memplexes and similar creatures as the agents, GoF maybe does not look that scary? Sure, there was covid, but while it was clearly really bad for humans, it mostly made governments/states relatively stronger. 

Taking this difference into account, my model was and still is governments will react to AI. 

This does not imply reacting in a helpfu... (read more)

  • The GoF analogy is quite weak.
  • "What exactly" seems a bit weird type of question.  For example, consider nukes: it was hard to predict what exactly is the model by which governments will not blow everyone up after use of nukes in Japan. But also: while the resulting equilibrium is not great, we haven't died in nuclear WWIII so far. 
Personally I haven't thought about how strong the analogy to GoF is, but another thing that feels worth noting is that there may be a bunch of other cases where the analogy is similarly strong and where major government efforts aimed at risk-reduction have occurred. And my rough sense is that that's indeed the case, e.g. some of the examples here []. In general, at least for important questions worth spending time on, it seems very weird to say "You think X will happen, but we should be very confident it won't because in analogous case Y it didn't", without also either (a) checking for other analogous cases or other lines of argument or (b) providing an argument for why this one case is far more relevant evidence than any other available evidence. I do think it totally makes sense to flag the analogous case and to update in light of it, but stopping there and walking away feeling confident in the answer seems very weird. I haven't read any of the relevant threads in detail, so perhaps the arguments made are stronger than I imply here, but my guess is they weren't. And it seems to me that it's unfortunately decently common for AI risk discussions on LessWrong to involve this pattern I'm sketching here.  (To be clear, all I'm arguing here is that these arguments often seem weak, not that their conclusions are false.) (This comment is raising an additional point to Jan's, not disagreeing.) Update: Oh, I just saw Steve Byrnes also the following in this thread, which I totally agree with:

The GoF analogy is quite weak.

As in my comment here, if you have a model that simultaneously both explains the fact that governments are funding GoF research right now, and predicts that governments would nevertheless react helpfully to AGI, I’m very interested to hear it. It seems to me that defunding GoF is a dramatically easier problem in practically every way.

The only responses I can think of right now are (1) “Basically nobody in or near government is working hard to defund GoF but people in or near government will be working hard to spur on a helpful... (read more)

This would be useful if the main problem was misuse, and while this problem is arguably serious, there is another problem, called the alignment problem, that doesn't care who uses AGI, only that it exists. Biotech is probably the best example of technology being slowed down in the manner required, and suffice it to say it only happened because eugenics and anything related to that became taboo after WW2. I obviously don't want a WW3 to slow down AI progress, but the main criticism remains: The examples of tech that were slowed down in the manner required for alignment required massive death tolls, ala a pivotal act.

Empirically, evolution did something highly similar.

While I have a lot of sympathy for the view expressed here, it seems confused in a similar way to straw consequentialism, just in an opposite direction.

Using the terminology from Limits to Legibility, we can roughly split the way how we do morality into two types of thinking
- implicit / S1 / neural-net type / intuitive
- explicit / S2 / legible

What I agree with:

In my view, the explicit S2 type processing basically does not have the representation capacity to hold "human values", and the non-legible S1 neural-net boxes are necessary for being moral.

Attempts ... (read more)

Agreed. As I say in the post: I also mention that faking it til you make it (which relies on explicit S2 type processing) is also justified sometimes, but something one ideally dispenses with. Of course. But I want to highlight something you might be have missed: part of the lesson of the "one thought too many" story is that sometimes explicit S2 type processing is intrinsically the wrong sort of processing for that situation: all else being equal you would be better person if you relied on S1 in that situation. Using S2 in that situation counted against your moral standing. Now of course, if your S1 processing is so flawed that it would have resulted in you taking a drastically worse action, then relying on S2 was overall the better thing for you to do in that moment. But, zooming out, the corollary claim here (to frame things another way) is that even if your S2 process was developed to arbitrarily high levels of accuracy in identifying and taking the right action, there would still be value left on the table because you didn't develop your S1 process. There are a few ways to cash out this idea, but the most common is to say this: developing one's character (one's disposition to feel and react a certain way when confronted with a given situation – your S1 process) in a certain way (gaining the virtues) is constitutive of human flourishing – a life without such character development is lacking. Developing one's moral reasoning (your S2 process) is also important (maybe even necessary), but not sufficient for human flourishing. Regarding explanatory fundamentality: I don't think your analogy is very good. When you describe mechanical phenomena using the different frameworks you mention, there is no disagreement between them about the facts. Different moral theories disagree. They posit different assumptions and get different results. There is certainly much confusion about the moral facts, but saying theorists are confused about whether they disagree with each ot

What's described as An ICF technique is just that, one technique among many. 

ICF does not make the IFS assumption that there is some "neutral self". It makes a prediction that when you unblend few parts from the whole, there is still a lot of power in "the whole".  It also makes the claim that in typical internal conflicts and tensions, there just a few parts which are really activated (and not, e.g., 20). Both seems experimentally verifiable (at least in phenomenological sense) - and true.

In my view there is a subtle difference between the "self... (read more)

My technical explanation for why not direct consequentialism is somewhat different - deontology and virtue ethics are effective theories . You are not almost unbounded superintelligence => you can't rely on direct consequentialism.

Why virtue ethics works? You are mostly a predictive processing system. Guess at simple PP story:
PP is minimizing prediction error. If you take some unvirtuous action, like, e.g. stealing a little, you are basically prompting the pp engine to minimize total error between the action taken, your self-model / wants model, and you... (read more)

Nice, thanks for the pointer.

My overall guess is after surfing through / skimming philosophy literature on this for many hours is you can probably find all core ideas of this post somewhere in it, but it's pretty frustrating - scattered in many places and diluted by things which are more confused.

  • Virtue ethics is the view that our actions should be motivated by the virtues and habits of character that promote the good life

This sentence doesn't make sense to me. Do you mean something like "Virtue ethics is the view that our actions should be motivated by the virtues and habits of character they promote" or "Virtue ethics is the view that our actions should reinforce virtues and habits of character that promote the good life"? It looks like two sentences got mixed up

Sorry for confusion I tried to paraphrase what classical virtue ethicist believe, in ... (read more)

In my view, you are possibly conflating 
- ICF as a framework
- described "basic ICF technique"

To me, ICF as a framework seem distinct from IFS in how it is built.  As you say, introductory IFS materials take the stories about exiles and protectors as pretty real, and also often use parallels with family therapy. On the more theoretical side, my take on parts of your sequence on parts is you basically try to fit some theoretical models (e,g, RL, global workspace) to the "standard IFS prior" about types of parts. 

ICF is build the opposite way:&... (read more)

What bothers me about this framing is that experienced practitioners for many different skills end up unlearning some of the advice that's useful for beginners. It's just the nature of much knowledge that you need to explain it in a simplified kind-of-false form first [], before you have hope of conveying the complete form. But we don't say that we should have a separate label for the core of physics or chemistry that works; we just call it all "physics" and "chemistry". Or for a different domain where the understanding is built up more from the learner's own experience and pattern-recognition ability than learning increasingly sophisticated scientific theories, here's Josh Waitzkin [] on how chessmasters end up unlearning previous things they knew about the value of individual chess pieces:
5Matt Goldenberg6mo
I think there is a name for the core of the approaches which works, which is "parts work." The ICF framework seems to add some things on top of the basic parts work idea that make it similar to IFS. For instance, the process of unblending at the beginning is basically the same as what IFS calls "getting into self".  In contrast, there are many effective parts work frameworks that do the work from a blended state, such as voice dialogue.  It imports the assumption from IFS that there is some "neutral self" that can be reached by continually unblending, and that this self can moderate between parts. In addition, IFS and ICF both seem to emphasize "conversation" as a primary modality, whereas other parts work modalities (e.g. Somatic Experiencing) emphasize other modalities when working with parts, such as somatic, metaphorical, or primal. Again, there's an assumption here about what parts are and how they should be worked with, around the primacy of particular ways of thinking and relating which is heavily (if unconsciously) influenced by the prevalance of IFS and its' way of working. It seems like while ICF is trying to describe a general framework, it is quite influenced by the assumptions of IFS/IDC and imports some of their quirks, even while getting rid of others.
Ah yeah probably, I only know ICF from the description in this post. So when I said ICF, I basically meant "the technique described here". I see, that's a definite difference then. I read the article's claim that ICF tries to "generalise from a class of therapy schools and introspection techniques working with parts of the mind" as meaning that the model takes existing therapy schools as its starting point to derive its assumptions from them and their empirical observations, as opposed to deriving things from layered agency in a more first-principles manner. I guess that makes sense - but then is ICF sufficiently general for that either? E.g. if we're talking about IFS ideas that one might want to unlearn, I think that sometimes it's useful to abandon the assumption of the parts being discrete units, and at least this article made it sound like ICF would still assume that. But maybe I'd need to know more about the general framework to know what assumptions it makes, in order to have this discussion.

I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".

By reinventing it, I means, for example, asking questions like "how to influence the dynamic between AI labs in a way which allows everyone to slow down at critical stage", "can we convince some actors about AI risk without the main effect being they will put more resources into the race", "what's up ... (read more)

Here is a sceptical take: anyone who is prone to getting convinced by this post to switch to attempts at “buying time” interventions from attempts at do technical AI safety is pretty likely not a good fit to try any high-powered buying-time interventions. 

The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.

Figuring out that AI strategy and governance are maybe important, in late 2022, after spending substantial time on AI safety... (read more)

I agree that "buying time" isn't a very useful category. Some thoughts on the things which seem to fall under the "buying time" category: * Evaluations * I think people should mostly consider this as a subcategory of technical alignment work, in particular the work of understanding models. The best evaluations will include work that's pretty continuous with ML research more generally, like fine-tuning on novel tasks, developing new prompting techniques, and application of interpretability techniques. * Governance work, some subcategories of which include: * Lab coordination: work on this should mainly be done in close consultation with people already working at big AI labs, in order to understand the relevant constraints and opportunities * Policy work: see standard resources on this [] * Various strands of technical work [] which is useful for the above * Outreach * One way to contribute to outreach is doing logistics for outreach programs (like the AGI safety fundamentals course) * Another way is to directly engage with ML researchers * Both of these seem very different from "buying time" - or at least "outreach to persuade people to become alignment researchers" doesn't seem very different from "outreach to buy time somehow"

The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.

I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".

Separately, I also don't really think it's worth spending a ton of time trying to really understand what current people think ab... (read more)

Sorry for being snarky, but I think at least some LW readers should gradually notice to what extent is the stuff analyzed here mirroring the predictive processing paradigm, as a different way how to make stuff which acts in the world. My guess is the big step on the road in this direction are not e.g. 'complex wrappers with simulated agents', but reinventing active inference... and also I do suspect it's the only step separating us from AGI, which seems like a good reason why not to try to point too much attention in that way. 

It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology. 

I chose the "train a shoulder advisor" framing specifically to keep my/Eliezer's models separate from the participants' own models. And I do think this worked pretty well - I've had multiple conversations with a participant where they say something, I disagree with it, and then they say "yup, that's what my John model said" - implying that they did in fact disagree with their John model. (That's not quite direct evidence of maintaining a separate ontology, but it's adjacent.)

A couple of years ago we developed something in this direction in Epistea that seems generally better and a little less confused, called "Internal Communication Framework".

I won't describe the whole technique here, but the generative idea mostly is "drop priors used in IFS or IDC, and instead lean into a metaphor of facilitating a conversation between parts from a position of kindness and open curiosity". (Second part is getting more theoretical clarity on the whole-parts relation)

From the perspective of ICF, what seems suboptimal with the IDC algorithm de... (read more)

For what it's worth, my experience is that while the written materials for IFS make a somewhat big deal out of the exile/firefighter/manager thing, people trained in IFS don't give it that much attention when actually doing it. In practice, the categories aren't very rigid and a part can take on properties of several different categories. I'd assume that most experienced facilitators would recognize this and just focus on understanding each part "as an individual", without worrying about which category it might happen to fit in. I guess this is what you already say in the last paragraph, but I'd characterize it more as "the formal protocols are the simplified training wheels version of the real thing" than as "skilled facilitators stop doing the real thing described in the protocols". (Also not all IFS protocols even make all those distinctions, e.g. "the 6 Fs []" only talks about a protector that has some fear.)
"Instead lean into a metaphor of facilitating a conversation between parts from a position of kindness and open curiosity" is ... straightforwardly what the essay above recommends?

The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts.  And how the composition works / what survives renormalization / ... is almost the whole problem.


This seems to be almost exclusively based on the proxies of humans and human institutions. Reasons why this does not necessarily generalize to advanced AIs are often visible when looking from a perspective of other proxies, eg. programs or insects.


So far, progress of ML often led to this pattern:

1. ML models sort of suck, maybe help a bit sometimes. Humans are clearly better ("humans better").
2. ML models get overall comparable to humans, but have different strengths and weaknesses; human+AI teams beat both best AIs alone, or best humans a... (read more)

With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.

While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular

- Friston et al certainly understand this (cf ... do... (read more)

Jan, I agree with your references, especially Friston et al.  I think those kinds of understanding, as you say, have not adequately made their way into utility utility-theoretic fields like econ and game theory, so I think the post is valid as a statement about the state of understanding in those utility-oriented fields.  (Note that the post is about "a missing concept from the axioms of game theory and bargaining theory" and "a key missing concept from utility theory", and not "concepts missing from the mind of all of humanity".)

I would correct "Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did not have to start from scratch, and already had a reasonably complex 'API' for interoceptive variables."

from the summary to something like this

"Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did have to start locating 'goals' and relevant world-features in the learned world models. Instead, it re-used the the existing goal-specifying circuits, and implicit-world-models, existing in older organisms. Most of the goal... (read more)

Jan - well said, and I strongly agree with your perspective here. Any theory of human values should also be consistent with the deep evolutionary history of the adaptive origins and functions of values in general - from the earliest Cambrian animals with complex nervous systems through vertebrates, social primates, and prehistoric hominids.  As William James pointed out in 1890 (paraphrasing here), human intelligence depends on humans have more evolved instincts, preferences, and values than other animals, not having fewer.
Load More