All of Koen.Holtman's Comments + Replies

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

this is something you would use on top of a model trained and monitored by engineers with domain knowledge.

OK, that is a good way to frame it.

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

I guess I should make another general remark here.

Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities.

However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering.

This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.

Hi Koen, We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented - especially since that's what ML researchers are used to seeing - but we actually intended it as a toy model for automated detection and correction of unexpected 'model splintering' during monitoring of models in deployment. In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.
Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Interesting. Some high-level thoughts:

When reading your definition of concept extrapolation as it appears here here:

Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.

this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-... (read more)

Principles for Alignment/Agency Projects

Not sure what makes you think 'strawmen' at 2, but I can try to unpack this more for you.

Many warnings about unaligned AI start with the observation that it is a very bad idea to put some naively constructed reward function, like 'maximize paper clip production', into a sufficiently powerful AI. Nowadays on this forum, this is often called the 'outer alignment' problem. If you are truly worried about this problem and its impact on human survival, then it follows that you should be interested in doing the Hard Thing of helping people all over the world w... (read more)

Principles for Alignment/Agency Projects

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

3Viktor Rehnberg1mo
I agree with 1 (but then it is called alignment forum, not the more general AI Safety forum). But I don't see that 2 would do much good. All narratives I can think of where 2 plays a significant part sounds like strawmen to me, perhaps you could help me?
Looking back on my alignment PhD

I have said nice things about AUP in the past (in past papers I wrote) and I will continue to say them. I can definitely see real-life cases where adding an AUP term to a reward function makes the resulting AI or AGI more aligned. Therefore, I see AUP as a useful and welcome tool in the AI alignment/safety toolbox. Sure, this tool alone does not solve every problem, but that hardly makes it a pointless tool.

From your off-the-cuff remarks, I am guessing that you are currently inhabiting the strange place where 'pivotal acts' are your preferred alignment solution. I will grant that, if you are in that place, then AUP might appear more pointless to you than it does to me.

AGI Ruin: A List of Lethalities

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"?

I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field).

Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then r... (read more)

AGI Ruin: A List of Lethalities

You are welcome. I carefully avoided mentioning my credentials as a rhetorical device.

I rank the credibility of my own informed guesses far above those of Eliezer.

This is to highlight the essence of how many of the arguments on this site work.

AGI Ruin: A List of Lethalities

Why do you rate yourself "far above" someone who has spent decades working in this field?

Well put, valid question. By the way, did you notice how careful I was in avoiding any direct mention of my own credentials above?

I see that Rob has already written a reply to your comments, making some of the broader points that I could have made too. So I'll cover some other things.

To answer your valid question: If you hover over my LW/AF username, you can see that I self-code as the kind of alignment researcher who is also a card-carrying member of the academic... (read more)

Thanks for taking my question seriously - I am still a bit confused why you would have been so careful to avoid mentioning your credentials up front, though, given that they're fairly relevant to whether I should take your opinion seriously. Also, neat, I had not realized hovering over a username gave so much information!
AGI Ruin: A List of Lethalities

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future... (read more)

Apologies if there is a clear answer to this, since I don't know your name and you might well be super-famous in the field: Why do you rate yourself "far above" someone who has spent decades working in this field? Appealing to experts like MIRI makes for a strong argument. Appealing to your own guesses instead seems like the sort of thought process that leads to anti-vaxxers.
AGI Ruin: A List of Lethalities

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.

I can honestly say however that the project of writing t... (read more)

Announcing the Alignment of Complex Systems Research Group

If you’re interested in conceptual work on agency and the intersection of complex systems and AI alignment

I'm interested in this agenda, and I have been working on this kind of thing myself, but I am not interested at this time in moving to Prague. I figure that you are looking for people interested in moving to Prague, but if you are issuing a broad call for collaborators in general, or are thinking about setting up a much more distributed group, please clarify.

A more technical question about your approach:

What we’re looking for is more like a ver

... (read more)
Reshaping the AI Industry

There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that

The wider AI research community is an almost-optimal engine of apocalypse.


AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.

I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace... (read more)

Would (myopic) general public good producers significantly accelerate the development of AGI?

What are some of those [under-produced software] components? We can put them on a list.

Good question. I don't have a list, just a general sense of the situation. Making a list would be a research project in itself. Also, different people here would give you different answers. That being said,

  • I occasionally see comments from alignment research orgs who do actual software experiments that they spend a lot of time on just building and maintaining the infrastructure to run large scale experiments. You'd have to talk to actual orgs to ask them w

... (read more)
Would (myopic) general public good producers significantly accelerate the development of AGI?

I guess I got that impression from the 'public good producers significantly accelerate the development of AGI' in the title, and then looking at the impactcerts website.

I somehow overlooked the bit where you state that you are also wondering if that would be a good idea.

To be clear: my sense of the current AI open source space is that it definitely under-produces certain software components, software components that could be relevant for improving AI/AGI safety.

What are some of those components? We can put them on a list. By the way, "myopic" means "pathologically short-term".
Would (myopic) general public good producers significantly accelerate the development of AGI?

If I am reading you correctly, you are trying to build an incentive structure that will accelerate the development of AGI. Many alignment researchers (I am one) will tell you that this is not a good idea, instead you want to build an incentive structure that will accelerate the development of safety systems and alignment methods for AI and AGI.

There is a lot of open source production in the AI world, but you are right in speculating that a lot of AI code and know-how is never open sourced. Take a look at the self-driving car R&D landscape if you want... (read more)

No, I'm not sure how you got that impression (was it "failing to coordinate"?), I'm asking for the opposite reason.
Paradigm-building from first principles: Effective altruism, AGI, and alignment

First, a remark on Holden's writeup. I wrote above that Several EA funding managers are on record as wanting to fund pre-paradigmatic research, From his writeup, I am not entirely sure if Holden is one of them, the word 'paradigmatic' does not appear in it. But it is definately clear that Holden is not very happy with the current paradigm of AI research, in the Kuhnian sense where a paradigm is more than just a scientific method but a whole value system supported by a dominant tribe.

To quote a bit of Wikipedia:

Kuhn acknowledges having used the term "p

... (read more)
Why I'm co-founding Aligned AI

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

Paradigm-building: The hierarchical question framework

A comment on your list of questions after reading the whole sequence: unlike John and Tekhne elsewhere in this comment thread, I am pretty comfortable with the hierarchical list of questions you are developing here.

This is a pretty useful set of questions that could be taken as starting points for all kinds of useful paradigmatic research.

I believe that part of John's lack of comfort with the above list of questions is caused by a certain speculative assumption he makes about AGI alignment, an assumption that is also made by many in MIRI, an assumption pop... (read more)

Paradigm-building: Conclusion and practical takeaways

Just want to say: I read the whole sequence and enjoyed reading it.

Question 3: Control proposals for minimizing bad outcomes

An intriguing and neglected direction for control proposal research concerns endogenous control—i.e., self-control.

Agree. To frame this in paradigm-language: most of the discussion on this forum, both arguments about AI/AGI dangers and plans that consider possible solutions, uses paradigm A:

Paradigm A: We treat the AGI as a spherical econ with an unknown and opaque internal structure, which was set up to maximise a reward function/reward signal.

But there is also

Paradigm B: We treat the AGI as a computer program with an internal motivation and structu... (read more)

Question 4: Implementing the control proposals

I like you writing about this: the policy problem is not mentioned often enough on this forum. Agree that it needs to be part of AGI safety research.

I have no deep insights to add, just a few high level remarks:

to pass laws internationally that make it illegal to operate or supervise an AGI that is not properly equipped with the relevant control mechanisms. I think this proposal is necessary but insufficient. The biggest problem with it is that it is totally unenforceable.

I feel that the 'totally unenforceable' meme is very dangerous - it is too ofte... (read more)

1Cameron Berg6mo
Thanks for your comment! I agree with both of your hesitations and I think I will make the relevant changes to the post: instead of 'totally unenforceable,' I'll say 'seems quite challenging to enforce.' I believe that this is true (and I hope that the broad takeaway from this post is basically the opposite of 'researchers need to stay out of the policy game,' so I'm not too concerned that I'd be incentivizing the wrong behavior). To your point, 'logistically and politically inconceivable' is probably similarly overblown. I will change it to 'highly logistically and politically fraught.' You're right that the general failure of these policies shouldn't be equated with their inconceivability. (I am fairly confident that, if we were so inclined, we could go download a free copy of any movie or song we could dream of—I wouldn't consider this a case study of policy success—only of policy conceivability!).
Question 1: Predicted architecture of AGI learning algorithm(s)

I'm interested to see where you will take this.

A terminology comment: as part of your classification system. you are calling 'supervised learning' and 'reinforcement learning' two different AI/AGI 'learning algorithm architectures'. This takes some time for me to get used to. It is more common in AI to say that SL and RL solve two different problems, are different types of AI.

The more common framing would be to say that an RL system is fundamentally an example of an an autonomous agent type AI, and an SL system is fundamentally an example of an input cla... (read more)

Thoughts on AGI safety from the top

I like your section 2. As you are asking for feedback on your plans in section 3:

By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...] Since this is what I plan to do, it'd be useful for me to know if it seems totally misguided

I see two ways to improve AI transparency in the face of opaque learned models:

  1. try to make the learned models less opaque -- this is your direction

  2. try to find ways to build more transp

... (read more)
Paradigm-building from first principles: Effective altruism, AGI, and alignment

I like what you are saying above, but I also think there is a deeper story about paradigms and EA that you are not yet touching on.

I am an alignment researcher, but not an EA. I read quite broadly about alignment research, specifically I also read beyond the filter bubble of EA and this forum. What I notice is that many authors, both inside and outside of EA, observe that the field needs more research and more fresh ideas.

However, the claim that the field as a whole is 'pre-paradigmatic' is a framing that I see only on the EA and Rationalist side.

To ma... (read more)

2Arthur Conmy6mo
I am interested in this criticism, particularly in connection to misconception 1 from Holden's 'Important, actionable research questions for the most important century [] ', which to me suggests doing less paradigmatic research (which I interpret to mean 'what 'normal science' looks like in ML research/industry' in the Structure of Scientific Revolutions sense, do say if I misinterpret 'paradigm'). I think this division would benefit from some examples however. To what extent to you agree with a quick classification of mine? Paradigmatic alignment research 1) Interpretability of neural nets (e.g colah's vision and transformer circuits) 2) Dealing with dataset bias and generalisation in ML Pre-paradigmatic alignment research 1) Agentic foundations and things MIRI work on 2) Proposals for alignment put forward by Paul Christiano, e.g Iterated Amplification My concern is that while the list two problems are more fuzzy and less well-defined, they are far less direcetly if at all (in 2) actually working on the problem we actually care about.
Instrumental Convergence For Realistic Agent Objectives

instrumental convergence basically disappears for agents with utility functions over action-observation histories.

Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.

  1. u-AOH (utility functions over action-observation histories): No IC

  2. u-OH (utility functions over observation histories): Strong IC

There are many utility func... (read more)

I recommend reading the quoted post [] for clarification.
Is AI Alignment a pseudoscience?

I am not familiar with the specific rationalist theory of AGI developed in the high rationalist era of the early 2010s. I am not a rationalist, but I do like histories of ideas, so I am delighted to learn that such a thing as the high rationalist era of the early 2010s even exists.

If I were to learn more about the actual theory, I suspect that you and I would end up agreeing that the rationalist theory of AGI developed in the high rationalist era was crankish.

Yes. I was trying to avoid the downvote demon by hinting quietly. PS looks like he winged me.
Is AI Alignment a pseudoscience?

It is your opinion that despite the expenditure of a lot of effort, no specific laws of AGI have been found. This opinion is common on this forum, it puts you in what could be called the 'pre-paradigmatic' camp.

My opinion is that the laws of AGI are the general laws of any form of computation (that we can physically implement), with some extreme values filled in. See my original comment. Plenty of useful work has been done based on this paradigm.

Maybe it's common now. During the high rationalist era, early 2010s, there was supposed to be a theory of AGI based on rationality. The problem was that ideal rationality is uncomputable, so that approach would involve going against what is already known about computation, and therefore crankish. (And the claim that any AI is non ideally rational, whilst defensible for some values of non ideallyrational, is not useful, since there are many ways of being non-ideal).
Is AI Alignment a pseudoscience?

In physics, we can try to reason about black holes and the big bang by inserting extreme values into the equations we know as the laws of physics, laws we got from observing less extreme phenomena. Would this also be 'a fictional-world-building exercise' to you?

Reasoning about AGI is similar to reasoning about black holes: both of these do not necessarily lead to pseudo-science, though both also attract a lot of fringe thinkers, and not all of them think robustly all of the time.

In the AGI case, the extreme value math can be somewhat trivial, if you want... (read more)

Is AI Alignment a pseudoscience?

the Embedded Agency post often mentioned as a good introductory material into AI Alignment.

For the record: I feel that Embedded Agency is a horrible introduction to AI alignment. But my opinion is a minority opinion on this forum.

Is AI Alignment a pseudoscience?

There is a huge diversity in posts on AI alignment on this forum. I'd agree that some of them are pseudo-scientific, but many more posts fall in one of the following categories:

  1. authors follow the scientific method of some discipline, or use multidisciplinary methods,

  2. authors admit outright that they are in a somewhat pre-scientific state, i.e. they do not have a method/paradigm yet that they have any confidence in, or

  3. authors are talking about their gut feelings of what might be true, and again freely admit this

Arguably, posts of type 2 and 3 ab... (read more)

Scalar reward is not enough for aligned AGI

I agree with your general comments, and I'd like to add some additional observations of my own.

Reading the paper Reward is Enough, what strikes me most is that the paper is reductionist almost to the point of being a self-parody.

Take a sentence like:

The reward-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

I could rewrite this to

The physics-is-enough hypothesis postulates that intelligence, and its associated abilities,

... (read more)
Challenges with Breaking into MIRI-Style Research

Any thoughts on how to encourage a healthier dynamic.

I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.

My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.

So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main pu... (read more)

Challenges with Breaking into MIRI-Style Research

I like your summary of the situation:

Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.

This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.

Here are two ways to do preparadigmatic research:

  1. Find something that is all wrong with somebody else's paradigm, then write about it.

  2. Find a new useful paradigm and write about it.

MIR... (read more)

Hmm... Yeah, I certainly don't think that there's enough collaboration or appreciation of the insights that other approaches may provide. Any thoughts on how to encourage a healthier dynamic.
An Open Philanthropy grant proposal: Causal representation learning of human preferences

In other words, human preferences have a causal structure, can we learn its concepts and their causal relations?


Since I am not aware of anyone trying to use these techniques in AI Safety

I am not fully sure what particular sub-problem you propose to address (causal learning of the human reward function? Something different?), but some references to recent work you may find interesting:

Two recent workshops at NeurIPS 2021:

... (read more)
Hey Koen, Thanks a lot for the pointers! The literature I am most aware of are [], [] and Bernhard Scholkopf's webpage
My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks, yes that new phrasing is better.

Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.

In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader ... (read more)

My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.

I have a comment on the agendas to build safe AGI section however. In the section you write

I focus on three agendas I consider most prominent

When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'

D... (read more)

2Neel Nanda7mo
Thanks for the feedback! That makes sense, I've updated the intro paragraph to that section to: Does that seem better? For what it's worth, my main bar was a combination of 'do I understand this agenda well enough to write a summary' and 'do I associate at least one researcher and some concrete work with this agenda'. I wouldn't think of corrigibility as passing the second bar, since I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It's very possible I've missed out on some important work though, and I'd love to hear pushback on this
$1000 USD prize - Circular Dependency of Counterfactuals

Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'.

Hmm... Oh, I think that was elsewhere on this thread. Probably not to you. Eliezer's Where Recursive Justification Hits Bottom seems to embrace a circular epistemology despite its title.
$1000 USD prize - Circular Dependency of Counterfactuals

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.

Mind you, I... (read more)

If you're referring to the Wittgenstenian quote, I was merely quoting him, not endorsing his views.
$1000 USD prize - Circular Dependency of Counterfactuals

OK thanks for explaining. See my other recent reply for more thoughts about this.

$1000 USD prize - Circular Dependency of Counterfactuals

It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false

OK, so if I understand you correctly, you posit that there is something called 'circular epistemology'. You said in the earlier post you link to at the top:

You might think that the circularity is a problem, but circular epistemology turns out to be viable (see Eliezer's Where Recursive Justification Hits Bottom). And while circular reasoning is less than ideal, if the comparative is eventually hitting a point where we can provide

... (read more)
Yeah, I believe epistemology to be inherently circular. I think it has some relation to counterfactuals being circular, but I don't see it as quite the same as counterfactuals seem a lot harder to avoid using than most other concept. The point of mentioning circular epistemology was to persuade people that my theory isn't as absurd as it sounds at first.
$1000 USD prize - Circular Dependency of Counterfactuals

Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think

??? I don't follow. You meant to write "use system X instead of using system Y which calls itself a definition of counterfactuals "?

What I mean is that some people seem to think that if they can describe a system that explains counterfactuals without mentioning counterfactuals when explaining them that they've avoided a circular dependence. When of course, we can't just take things at face value, but have to dig deeper than that.
More Is Different for AI

Interesting. Reading the different paragraphs I am somewhat confused on how you classify thought experiments: part of engineering, part of philosophy, or third thing by itself?

I'd be curious to see you expand on following question: if we treat thought experiments as not being a philosophical technique, what other techniques or insights does philosophy have to offer to alignment?

Another comment: you write

When thinking about safety risks from ML, there are two common approaches, which I'll call the Engineering approach and the Philosophy approach.

My rec... (read more)

$1000 USD prize - Circular Dependency of Counterfactuals

Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.

I'll make a counter-claim and say that most people on LW in fact have rejected the use of Newcomb's Problem as a test that will say something useful about decision theories.

That being said, there is definitely a sub-community which believes deeply in the relevance of Newcomb... (read more)

Firstly, I don't see why that would interfere with evaluating possible arguments for and against circular dependency. It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false (not stating that an article would have to necessarily engage with 3 different arguments to win). Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think merely not mentioning counterfactuals means that there isn't a dependence on them. So there likely needs to be some part of such an article discussing why things that look counterfactual really aren't. I briefly skimmed your article and I'm sure if I read it further I'd learn something interesting, but merely as is it wouldn't be on scope.
Demanding and Designing Aligned Cognitive Architectures

Not entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal?

But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion.

Intuition pumps and inner alignment failure

It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen... (read more)

2Charlie Steiner8mo
This isn't about "inner alignment" (as catchy as the name might be), it's just about regular old alignment. But I think you're right. As long as the learning step "erases" the model editing in a sensible way, then I was wrong and there won't be an incentive for the learned model to compensate for the editing process. So if you can identify a "customer gender detector" node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender. I'm not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that's hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.
Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

I'm especially interested in the analogy between AI alignment and democracy.

This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms.

You may be interested in my recent paper 'demanding and designing', just announced here, where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the probl... (read more)

Consequentialism & corrigibility

Very open to feedback.

I have not read the whole comment section, so this feedback may already have been given, but...

I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn't fix it), and the problem remains open to this day.

Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and ... (read more)

Demanding and Designing Aligned Cognitive Architectures


I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand.

But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right?

Yes I am disagreeing, of sorts. I would disagree with the statement that

| AI alignment research is a subset of AI research

but I agree with the statement that

| Some parts of AI alignment research a... (read more)

2Charlie Steiner8mo
The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.
The Plan

I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence.

it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster.

Well, if you worry about people fiddling with control system tuning parameters... (read more)

The Plan

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia).

Interesting observation on the above post! Though I do not read it explicitly in John's Plan, I guess you can indeed implicitly read that John's Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first.

John, can you confirm this?

Background: my own default Plan does focus on control/myopia. I feel that this line of attack for ... (read more)

3Jon Garcia8mo
It's quite possible that control is easier than ambitious value learning, but I doubt that it's as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what's going on with the system. I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.
Load More