AlexMennen

# Comments

An overview of 11 proposals for building safe advanced AI

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

Intuitive Lagrangian Mechanics

I'm confused about the motivation for in terms of time dilation in general relativity. I was under the impression that general relativity doesn't even have a notion of gravitational potential, so I'm not sure what this would mean. And in Newtonian physics, potential energy is only defined up to an added constant. For to represent any sort of ratio (including proper time/coordinate time), V would have to be well-defined, not just up to an arbitrary added constant.

I also had trouble figuring out the relationship between the Euler-Lagrange equation and extremizing S. The Euler-Lagrange equation looks to me like just a kind of funny way of stating Newton's second law of motion, and I don't see why it should be equivalent to extremizing action. Perhaps this would be obvious if I knew some calculus of variations?

Relaxed adversarial training for inner alignment

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

An overview of 11 proposals for building safe advanced AI

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there's a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn't usually what happens.

Scott Garrabrant's problem on recovering Brouwer as a corollary of Lawvere
Let α be the least countable ordinal such that there is no polynomial-time computable recursive well-ordering of length α.

, which makes the claim you made about it vacuous.

Proof: Let be any computable well-ordering of . Let be the number of steps it takes to compute whether or not . Let (notice I'm using the standard ordering on , so this is the maximum of a finite set, and is thus well-defined). is computable in time. Let be a bijective pairing function such that both the pairing function and its inverse are computable in polynomial time. Now let be the well-ordering of given by , if is not for any , and if neither nor is of the form for any . Then is computable in polynomial time, and the order type of is plus the order type of , which is just the same as the order type of if that order type is at least .

The fact that you said you think is makes me suspect you were thinking of the least countable ordinal such that there is no recursive well-ordering of length that can be proven to be a recursive well-ordering in a natural theory of arithmetic such that, for every computable function, there's a program computing that function that the given theory can prove is total iff there's a program computing that function in polynomial time.

An Orthodox Case Against Utility Functions
This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Doesn't pointless topology allow for some distinctions which aren't meaningful in pointful topology, though?

Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB.

Assuming you're referring to "So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events", no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

An Orthodox Case Against Utility Functions
In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

An Orthodox Case Against Utility Functions

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

An Orthodox Case Against Utility Functions

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event "I take action A", I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

An Orthodox Case Against Utility Functions
Say I have a computer that will simulate an arbitrary Turing machine T, and will award me one utilon when that machine halts, and do nothing for me until that happens. With some clever cryptocurrency scheme, this is a scenario I could actually build today.

No, you can't do that today. You could produce a contraption that will deposit 1 BTC into a certain bitcoin wallet if and when some computer program halts, but this won't do the wallet's owner much good if they die before the program halts. If you reflect on what it means to award someone a utilon, rather than a bitcoin, I maintain that it isn't obvious that this is even possible in theory.

Why in the world would one expect a utility function over an uncountable domain to be computable?

There is a notion of computability in the continuous setting.

As far as I can see, the motivation for requiring a utility function to be computable is that this would make optimization for said utility function to be a great deal easier.

This seems like a strawman to me. A better motivation would be that agents that actually exist are computable, and a utility function is determined by judgements rendered by the agent, which is incapable of thinking uncomputable thoughts.

Load More