Thoughts in Philosophy of Science of AI Alignment

Wiki Contributions


Here is another interpretation of what can cause a lack of robustness to scaling down: 

(Maybe this is what you have in mind when you talk about single-single alignment not (necessaeraily) scaling to multi-multi alignment - but I am not sure that is the case, and even if it ism I feel pulled to stating it again more as I don't think it comes out as clearly as I would want it to in the original post.)

Taking the example of an "alignment strategy [that makes] the AI find the preferences of values and humans, and then pursu[e] that", robustness to scaling down can break if "human values" (as invoked in the example) don't "survive" reductionism; i.e. if, when we try to apply reducitonism to "human values", we are left with "less" than what we hoped for. 

This is the inverse of saying that there is an important non-linearity when trying to scale (up) from single-single alignment to multi-multi alignment. 

I think interpretation locates the reason for the lack of robustness in neither capabilities nor alignment regime, which is why I wanted to raise it. It's a claim about the nature or structural properties of "human values"; or a hint that we are deeply confused about human values (e.g. that the term currently refers to an incohernet cluster or "unnatural" abstraction)).

What you say about CEV might capture this fully, but whether it does, I think, is an empirical claim of sorts; a proposed solution to the more general diagnosis that I am trying to propose, namely that (the way we currently use the term) "human values" may itself not be robust to scaling down. 

Curious what different aspects the "duration of seclusion" is meant to be a proxy for? 

You defindefinitelyitly point at things like "when are they expected to produce intelligible output" and "what sorts of questions appear most relevant to them". Another dimension that came to mind - but I am not sure you mean or not to include that in the concept - is something like "how often are they allowed/able to peak directly at the world, relative to the length of periods during which they reason about things in ways that are removed from empirical data"? 

PIBBSS Summer Research Fellowship -- Q&A event

  • What? Q&A session with the fellowship organizers about the program and application process. You can submit your questions here.
  • For whom? For everyone curious about the fellowship and for those uncertain whether they should apply.
  • When? Wednesday 12th January, 7 pm GMT
  • Where? On Google Meet, add to your calendar

PIBBSS Summer Research Fellowship -- Q&A event

  • What? Q&A session with the fellowship organizers about the program and application process. You can submit your questions here.
  • For whom? For everyone curious about the fellowship and for those uncertain whether they should apply.
  • When? Wednesday 12th January, 7 pm GMT
  • Where? On Google Meet, add to your calendar

I think it's a shame that these days for many people the primary connotation of the word "tribe" is connected to culture wars. In fact, our decision to use this term was in part motivated by wanting to re-appropriate the term to something less politically loaded.

As you can read in our post (see "What is a tribe?"), we mean something particular. As any collective of human beings, it can in principle be subject to excessive in-group/out-group dynamics but that's by far not the only, nor the most interesting part of it. 

Context:  (1) Motivations for fostering EA-relevant interdisciplinary research; (2) "domain scanning" and "epistemic translation" as a way of thinking about interdisciplinary research

[cross-posted to the EA forum in shortform]

List of fields/questions for interdisciplinary AI alignment research

The following list of fields and leading questions could be interesting for interdisciplinry AI alignment reserach. I started to compile this list to provide some anchorage for evaluating the value of interdiscplinary research for EA causes, specifically AI alignment. 

Some comments on the list: 

  • Some of these domains are likely already very much on the radar of some people, other’s are more speculative.
  • In some cases I have a decent idea of concrete lines of question that might be interesting, in other cases all I do is very broadly gesturing that “something here might be of interest”.
  • I don’t mean this list to be comprehensive or authoritative. On the contrary, this list is definitely skewed by domains I happened to have come across and found myself interested in.
  • While this list is specific to AI alignment (/safety/governance), I think the same rationale applies to other EA-relevant domains and I'd be excited for other people to compile similar lists relevant to their area of interest/expertise.


Very interested in hearing thoughts on the below!


Target domain: AI alignment/safety/governance 

  1. Evolutionary biology
    1. Evolutionary biology seems to have a lot of potentially interesting things to say about AI alignment. Just a few examples include:
      1. The relationship between environment, agent, evolutionary paths (which e.g. relates to to the role of training environments)
      2. Niche construction as an angle on embedded agency
      3. The nature of intelligence
  2. Linguistics and Philosophy of language
    1. Lots of things that are relevant to understanding the nature and origin of (general) intelligence better.
    2. Sub-domains, such as semiotics could, for example, have relevant insights on topics like delegation and interpretability.
  3. Cognitive science and neuroscience
    1. Examples include Minsky’s Society of Minds (“The power of intelligence stems from our vast diversity, not from any single, perfect principle”), Hawkin’s A thousand brains (the role of reference frames for general intelligence), Frinston et al’s Predictive Coding/Predictive Processing (in its most ambitious versions a near universal theory of all things cognition, perception, comprehension and agency), and many more
  4. Information theory
    1. Information theory is hardly news to the AI alignment idea space. However, there might still be value on the table from deeper dives or more out-of-the-orderly applications of its insights. One example of this might be this paper on The Information Theory of Individuality.
  5. Cybernetics/Control Systems
    1. Cybernetics seems straightforwardly relevant to AI alignment. Personally, I’d love to have a piece of writing synthesising the most exciting intellectual developments under cybernetics done by someone with awareness of where the AI alignment field is at currently.
  6. Complex systems studies
    1. What does the study of complex systems have to say about robustness, interoperability, emergent alignment? It also offers insights into and methodology for approaching self-organization and collective intelligence which is interesting in particular in multi-multi scenarios.
  7. Heterodox schools of economic thinking
    1. Schools of thought are trying to reimagine the economy/capitalism and (political) organization, e.g. through decentralization and self-organization, by working on antitrust, by trying to understand potentially radical implications of digitalization on the fabric of the economy, etc. Complexity economics, for example, can help understanding the out-of-equilibrium dynamics that shape much of our economy and lives.
  8. Political economy
    1. An interesting framework for thinking about AI alignment as a socio-technical challenge. Particularly relevant from a multi-multi perspective, or for thinking along the lines of cooperative AI. Pointer: Mapping the Political Economy of Reinforcement Learning Systems: The Case of Autonomous Vehicles
  9. Political theory
    1. The richness of the history of political thought is astonishing; the most obvious might be ideas related to social choice or principles of governance. (A denses while also high-quality overview is offered by this podcast series History Of Ideas.) The crux in making the depth of political thought available and relevant to AI alignment is formalization, which seems extremely undersupplied in current academia for very similar reasons as I’ve argued above.
  10. Management and organizational theory, Institutional economics and Institutional design
    1. Has things to say about e.g. interfaces (read this to get a gist for why I think interfaces are interesting for AI alignment); delegation ( e.g. Organizations and Markets by Herbert SImon; (potentially) the ontology form forms and (the relevant) agent boundaries (e.g. The secret to social forms has been in institutional economics all along?)
    2. Talks for example about desiderata for institutions like robustness (e.g. here), or about how to understand and deal with institutional path-dependencies (e.g. here).

Glad to hear it seemed helpful!

FWIW I'd be interested in reading you spell out in more detail what you think you learnt from it about simulacra levels 3+4.

Re "writing the bottom line first": I'm not sure. I think it might be, but at least this connection didn't feel salient, or like it would buy me anything in terms of understanding, when thinking about this so far. Again interested in reading more about where you think the connections are. 

To maybe say more about why (so far) it didn't seem clearly relevant to me: "Writing the bottom line first", to me, comes with a sense of actively not wanting, and taking steps to avoid, figuring out where the arguments/evidence leads you. Maps of maps feels slightly different in so far as the person really wants to find the correct solution but they are utterly confused about how to do that, or where to look. Similarly, "writing the bottom line first" suggests that you do have a concrete "bottom line" that you want to be true, wherelse empty expectations don't have anything concrete to say about what you would want to be true  - there isn't (hardly) any object-level substance there.
Most succinctly, "writing the bottom line first" seems closer to motivated reasoning, and maps of maps/empty expectation seem closer to (some fundamental sense of) confusion (about where to even look to figure out the truth/solution). (Which, having spelt this out just now, makes the connection to simulacra levels 3+4 more salient.)


Regarding "Staying grounded and stable in spite of the stakes": 
I think it might be helpful to unpack the vritue/skill(s) involved according to the different timescales at which emergencies unfold. 

For example: 

1. At the time scale of minutes or hours, there is a virtue/skill of "staying level headed in a situation of accute crisis". This is the sort of skill you want your emergency doctor or firefighter to have. (When you pointed to the military, I think you in part pointed to this scale but I assume not only.)

From talking to people who do or did jobs like this, a typical pattern seems to be that some types of people when in siutations like this basically "freeze" and others basically move into a mode of "just functioning". There might be some margin for practice here (maybe you freeze the first time around and are able to snap out of the freeze the second time around, and after that, you can "remember" what it feels like to shift into funcitoning mode ever after) but, according to the "common wisdom" in these  prfoessions (as I undestand it), mostly people seem to fall in one or the other category. 

The sort of practice that I see being helpful here is a) overtraining on whatever skill you will need in the moment (e.g. imagine the emergency doctor) such that you can hand over most cognitive work to your autopilot once the emergency occurs; and b) train the skill of switching from freeze into high-functioning mode. I would expect "drill-type practices" are the most abt to get at that, but as noted above I don't know how large the margin for improvement is. (A subtlety here: there seems to be a massive difference between "being the first person to switch in to funcitoning mode", vs "switching into functioning mode after (literally or metaphorically speaking) someone screamed at your face to get moving". (Thinking of the military here.))

All that said, I don't feel particularly excited for people to start doing a bunch of drill practice or the like. I think there are possible extreme scenarios of "narrow hingy moments" that will involve this skill but overall this doesn't seem to me not to be the thing that is most needed/with highest EV.

(Probably also worth putting some sort of warning flag here: genuinly high-intensity situations can be harmful to people's psychy so one should be very cautious about experimenting with things in this space.)

2. Next, there might be a related virtue/skill at the timescale of weeks and months. I think the pandemic, especially from ~March to May/June is an excellent example of this, and was also an excellent learning opportunities for people involved in some time-sensitive covid-19 problem. I definitely think I've gained some gears on what a genuin (i.e. highly stakey) 1-3 month sprint involves, and what challenges and risks are invovled for you as an "agent" who is trying to also protect their agency/ability to think and act (though I think others have learnt and been stress-tested much more than I have). 

Personally, my sense is that this is "harder" than the thing in 1., because you can't rely on your autopilot much, and this makes things feel more like an adaptive rather than technical problem (where the latter is aproblem where the solution is basically clear, you just have to do it; and the latter is a problem most of the work needed is in figuring out the solution, not so much (necessarily) in executing it.)

One difficulty is that this skill/virtue involves managing your energy not only spending it well. Knowing yourself and hoy your energy and motivation structures work - and in particular how they work in extreme scenarios - seems very important. I can see how people who have meditated a lot have gained valuable skills here. I don't think it's th eonly way to get these skills, and I expect the thing that is paying off here is more "being able to look back on years of meditaton practice and the ways this has rewired one's brain in some deep sense" rather than "benefits from having a routine to meditate" or something like this. 

During the first couple of COVID-19 months, I was also surprised how "doing well at this" was more a question of collective rationality than I would have thought (by collective rationality I mean things like: ability to communciate effectively, ability to mobilise people/people with the right skills, abilty to delegate work effectively). There is still a large individual component of "staying on top of it all/keeping the horizon in sight" such that you are able to make hard decisoins (which you will be faced with en masse). 

I think it could be really good to collect lessons learnt from the folks invovled in some EA/rationlaist-adjacent COVID-19 projects.

3. The scale of ~(a few) years seems quite similar in type to 2. The main thing that I'd want to add here is that the challenge of dealing with strong uncertainty while the stakes are massive can be very psychologically challenge. I do think meditation and related practices can be helpful in dealing with that in a way that is both grounded and not flinching from the truth. 

I find myself wondering whether the miliatry does anything to help soldiers prepare for the act of "going to war" where the posisbility of death is extremely real. I imaigne they must do things to support people in this process. It's not exactly the same but there certainly are parallels with what we want. 

Re language as an example: parties involved in communication using language have comparable intelligence (and even there I would say someone just a bit smarter can cheat their way around you using language). 

Mhh yeah so I agree these examples of ways in which language "fails". But I think they don't bother me too much? 
I put them in the same category as "two agents with good faith sometimes miscommunicate - and still, language overall is pragmatically", or "works good enough". In other words, even though there is potential for exploitation, that potential is in fact meaningfully constraint. More importantly, I would argue that the constraint comes (in large parts) from the way the language has been (co-)constructed. 

a cascade of practically sufficient alignment mechanisms is one of my favorite ways to interpret Paul's IDA (Iterated Distillation-Amplification)

Yeah, great point!

Load More