There is a disheartening irony to calling this series "Practical AI Safety" and having the longest post being about capabilities advancements which largely ignore safety.
The first part of this post consists in observing that ML applications proceed from metrics, and subsequently arguing that theoretical approaches have been unsuccessful in learning problems. This is true but irrelevant for safety, unless your proposal is to apply ML to safety problems, which reduces AI Safety to 'just find good metrics for safe behaviour'. This seems as far from a pragmatic understanding of what is needed in AI Safety as one can get.
In the process of dismissing theoretical approaches, you ask "Why do residual connections work? Why does fractal data augmentation help?" These are exactly the kind of questions which we need to be building theory for, not to improve performance, but for humans to understand what is happening well enough to identify potential risks orthogonal to the benchmarks which such techniques are improving against, or trust that such risks are not present.
You say, "If we want to have any hope of influencing the ML community broadly, we need to understand how it works (and sometimes doesn’t work) at a high level," and provide similar prefaces as motivation in other sections. I find these claims credible, assuming the "we" refers to AI Safety researchers, but considering the alleged pragmatism of this sequence, it's surprising to me that none of the claims are followed up with suggested action points. Given the information you have provided, how can we influence this community? By publishing ML papers at NeurIPS? And to what end are you hoping to influence them? AI Safety can attract attention, but attention alone doesn't translate into progress (or even into more person-hours).
Your disdain for theoretical approaches is transparent here (if it wasn't already from the name of this sequence). But your reasoning cuts both ways. You say, "Even if the current paradigm is flawed and a new paradigm is needed, this does not mean that [a researcher's] favorite paradigm will become that new paradigm. They cannot ignore or bargain with the paradigm that will actually work; they must align with it." I expect that 'metrics suffice', (a strawperson of) your favoured paradigm, will not be the paradigm that will actually work, and it's disappointing that your sequence carries the message (to my reading) that technical ML researchers can make significant progress in alignment and safety without really changing what they're doing.
If I haven't found a way to extend my post-doc position (ending in August) by mid-July and by some miracle this job offer is still open, it could be the perfect job for me. Otherwise, I look forward to seeing the results.
A note on judging explanations
I should address a point that wasn't addressed in the post, and which may otherwise be a point of confusion going forward: the quality of an explanation can be high according to my criteria even if it isn't empirically correct. That is, there are some explanations of behaviour which may be falsifiable: if I am observing a robot, I could explain its behaviour in terms of an algorithm, and one way to "test" that explanation would be to discover the algorithm which the robot is in fact running. However, no matter the result of this test, the judged quality of the explanation is not affected. Indeed, there are two possible outcomes: either the actual algorithm provides a better explanation overall, or our explanatory algorithm could be a simpler algorithm with the same effects, and hence be a better explanation than the true one, since using this simpler algorithm is a more efficient way to predict the robot's behaviour than simulating the robot's actual algorithm.
This might seem counterintuitive at first, but it's really just Occam's razor in action. Functionally speaking, the explanations I'm talking about in this post aren't intended to be recovering the specific algorithm the robot is running (just as we don't need the specifics of its hardware or operating system); I am only concerned with accounting for the robot's behaviour.
Suppose your computer games, in addition to the long difficult path to your level's goal, also had little side-paths that you could use—directly in the game, as corridors—that would bypass all the enemies and take you straight to the goal, offering along the way all the items and experience that you could have gotten the hard way. And this corridor is always visible, out of the corner of your eye.
Even if you resolutely refused to take the easy path through the game, knowing that it would cheat you of the very experience that you paid money in order to buy—wouldn't that always-visible corridor, make the game that much less fun? Knowing, for every alien you shot, and every decision you made, that there was always an easier path?
This exact phenomenon happens in Deus Ex: Human Revolution, where you can get around almost every obstacle in the game by using the ventilation system. The frustration that results is apparent in this video essay/analysis: it undermines all of the otherwise well-designed systems in the game in spite of not actually interfering with the player's ability to engage with them.
I wonder if, alongside the "loss of rejected options" proposition, a reason that extra choices impact us is the mental bandwidth they take up. If the satisfaction we derive from a choice is (to a first-order approximation) proportional to our intellectual and emotional investment in the option we select, then having more options leaves less to invest as soon as the options go from being free to having any cost at all. As an economic analogy, a committee seeking to design a new product or building must choose between an initial set of designs. The more designs there are, the more resources must go into the selection procedure, and if the committee's budget is fixed, then this will remove resources that could have improved the product further down the line.
[0,1] is a commutative quantale when equipped with its usual multiplication. You can lift the monoidal product structure to sheaves on [0,1] (viewed as a frame) via Day convolution. So we recover a topos where the truth values are probabilities.
People who have attempted to build toposes with probabilities as truth values have also failed to notice this. Take Isham and Doering's paper, for example, (which I personally am quite averse to because they bullishly follow through on constructing toposes with certain properties which are barely justified). They don't even think about products of probabilities.
I think the monoidal topos on the unit interval merits some serious investigation.
I see what you're getting at. For an arbitrary explanation, we need to take into account not only the complexity of the explanation itself, but also how difficult it is to compute a relevant prediction from that explanation; according to my criteria, the Standard Model (or any sufficiently detailed theory of physics that accurately explains phenomena within a conservative range of low-ish energy environments encountered on Earth) would count as a very good explanation for any behaviour for its complexity, but that's ignoring the fact that it would be impossible to actually compute those predictions.While I made the claim that there is a clear dividing line between (accuracy and power) and (complexity), this strikes me as an issue straddling complexity and explanatory power, which muddies the water a little.Since I've appealed to physics explanations in my post, I'm glad you've made me think about these points. Moving forward, though, I expect the classes of explanation under consideration to be so constrained as to make this issue insignificant. That is, I expect to be directly comparing explanations taking the form of goals to explanations taking the form of algorithms or similar; each of these has a clear interpretation in terms of its predictions and, while the former might be harder to compute, the difference in difficulty is going to be suitably uniform across the classes (after accounting for complexity of explanations), so that I feel justified in ignoring it until later.
Thanks for the ideas!I like the idea about the size of the target states; there's bound to be some interesting measure theory that I can apply if I decide to formalize in that direction. In fact, measure theory might be able to clarify some of the subtleties I alluded to above regarding what happens when we refine the world model (for example, in a way that causes a single goal state to split into two or more).There are hints in your last paragraph of associating competence with goal-directedness, which I think is an association to avoid. For example, when a zebra is swimming across a river as fast as it can, I would like the extent to which that behaviour is considered goal-directed to be independent of whether that zebra is the one that gets attacked by a crocodile.
The example you give has a pretty simple lattice of preferences, which lends itself to illustrations but which might create some misconceptions about how the subagent model should be formalized. For example, in your example you assume that the agents' preferences are orthogonal (one cares about pepperoni, the other about mushrooms, and each is indifferent to the opposite direction), the agents have equal weighting in the decision-making, the lattice is distributive... Compensating for these factors, there are many ways that a given 'weak utility' can be expressed in terms of subagents. I'm sure there are optimization questions that follow here, about the minimum number of subagents (dimensions) needed to embed a given weak-utility function (partially ordered set), and about when reasonable constraints such as orthogonality of subagents can be imposed. There are also composition questions: how does a committee of agents with subagents behave?
It's really nice to see a critical take on analytic philosophy, thank you for this post. The call-out aspect was also appreciated: coming from mathematics, where people are often quite reckless about naming conventions to the detriment of pedagogical dimensions of the field, it is quite refreshing.
On the philosophy content, it seems to me that many of the vices of analytic philosophy seem hard to shake, even for a critic such as yourself.
Consider the "Back to the text" section. There is some irony in your accusation of Chalmers basing his strategy on its name via its definition rather than the converse, yet you end that section with giving a definition-by-example of what engineering is and proceed with that definition. To me, this points to the tension between dismissing the idea that concepts should be well-defined notions in philosophical discourse, while relying on at least some precision of denotation in using names of concepts in discourse.
You also seem to lean on anthropological principles as analytic philosophy does. I agree that the only concepts which will appear in philosophical discourse will be those which are relevant to human experience, but that experience extends far beyond "human life" to anything of human interest (consider the language of physics and mathematics, which often doesn't have direct relation to our immediate experience), and this is a consequence of the fact that philosophy is a human endeavour rather than anything intrinsic to its content.
I'd like to take a different perspective on your Schmidhuber quote. Contrary to your interpretation, the fact that concepts are physically encoded in neural structures supports the Platonic idea that these concepts have an independent existence (albeit a far more mundane one than Plato might have liked). The empirical philosophy approach might be construed as investigating the nature of concepts statistically. However, there is a category error happening here: in pursuing this investigation, an empirical philosopher is conflating the value of the global concept with their own "partial" perspective concept.
I would argue that, whether one is convinced they exist or not, no one is invested in communal concepts, which are the kind of fragmented, polysemous entity which you describe, for their own sake. Individuals are mostly invested in their own conceptions of concepts, and take an interest in communal concepts only insofar as they are interested in being included in the community in which it resides. In short, relativism is an alternative way to resolve concepts: we can proceed not by rejecting the idea that concepts can have clear definitions (which serve to ground discourse in place of the more nebulous intuitions which motivate them), but rather by recognizing that any such definitions must come with a limited scope. I also personally reject the idea that a definition should be expected to conform to all of the various "intuitions" which are appealed to in classical philosophy for various reasons, but especially because there seems no a priori reason that any human should have infallible (or even rational) intuitions about concepts.
I might even go so far as to say that recognizing relativism incorporates your divide and conquer approach to resolving disagreement: the gardeners and landscape artists can avoid confusion when they discuss the concept of soil by recognizing their differing associations with the concept and hence specifying the details relevant to the union of their interests. But each can discard the extraneous details in discussion with their own community, just as physicists will go back to talking about "sound" in its narrowed sense when talking with other physicists. These narrowings only seem problematic if one expects the scope of all discourse to be universal.
In section 4.6, you described an "unnatural" reward function splintering, and went on to advocate for more natural ones. I would agree with your argument as a general principle, but on the other hand I can think of situations where an exceptional case should be accounted for. Suppose that the manager of the rube-blegg factory keeps a single rube and a single blegg in a display case on the factory floor to present to touring visitors. A rube classifier which physically sorts rubes and bleggs should be able to recognize that these displayed examples are not to be sorted with the others, even though this requires making an unnatural extension of its internal reward function.I think your examples in Section 6 of suitably deferring to human values upon model splintering could resolve this, but to me it highlights that a naive approach to model splintering could result in problems if the AI is not keeping track of enough features of the world to identify when an automatic "natural" extension of its model is inappropriate.