All of Oscar_Cunningham's Comments + Replies

Yup, you want a bi-interpretation:

Two models or theories are mutually interpretable, when merely each is interpreted in the other, whereas bi-interpretation requires that the interpretations are invertible in a sense after iteration, so that if one should interpret one model or theory in the other and then re-interpret the first theory inside that, then the resulting model should be definably isomorphic to the original universe

Bi-interpretation in weak set theories

They will in fact stop you from taking fancy laser measuring kit to the top of the Shard:

I don't think that's the only reason - if I value something linearly, I still don't want to play a game that almost certainly bankrupts me.

I still think that's because you intuitively know that bankruptcy is worse-than-linearly bad for you. If your utility function were truly linear then it's true by definition that you would trade an arbitrary chance of going bankrupt for a tiny chance of a sufficiently large reward.

I mean, that's not obvious - the Kelly criterion gives you, in the example with the game, E(money) = $240, compared to $246.61 with the

... (read more)
I've been thinking about it, and I'm not sure if this is the case in the sense you mean it - expected money maximization doesn't reflect human values at all, white Kelly criterion mostly does, so if we make our assumptions more realistic, it should move us away from expected money maximization and towards the Kelly criterion, as opposed to moving us the other way.

It bankrupts you with probability 1 - 0.6^300, but in the other 0.6^300 of cases you get a sweet sweet $25 × 2^300. This nets you an expected $1.42 × 10^25.

Whereas Kelly betting only has an expected value of $25 × (0.6×1.2 + 0.4×0.8)^300 = $3220637.15.

Obviously humans don't have linear utility functions, but my point is that the Kelly criterion still isn't the right answer when you make the assumptions more realistic. You actually have to do the calculation with the actual utility function.

So, by optimal, you mean "almost certainly bankrupt you." Then yes. My definition of optimal is very different. I don't think that's the only reason - if I value something linearly, I still don't want to play a game that almost certainly bankrupts me. I mean, that's not obvious - the Kelly criterion gives you, in the example with the game, E(money) = $240, compared to $246.61 with the optimal strategy. That's really close.

The answer is that you bet approximately Kelly.

No, it isn't. Gwern never says that anywhere, and it's not true. This is a good example of what I'm saying.

For clarity the game is this. You start with $25 and you can bet any multiple of $0.01 up to the amount you have. A coin is flipped with a 60/40 bias in your favour. If you win you double the amount you bet, otherwise you lose it. There is a cap of $250, so after each bet you lose any money over this amount (so in fact you should never make a bet that could take you over). This continues for 300 rounds.

Bo... (read more)

Hmm. I think we might be misunderstanding each other here. When I say Gwern's post leads to "approximately Kelly", I'm not trying to say it's exactly Kelly. I'm not even trying to say that it converges to Kelly. I'm trying to say that it's much closer to Kelly than it is to myopic expectation maximization. Similarly, I'm not trying to say that Kelly maximizes expected value. I am trying to say that expected value doesn't summarize wipeout risk in a way that is intuitive for humans, and that those who expect myopic expected values to persist across a time series of games in situations like this will be very surprised. I do think that people making myopic decisions in situation's like Bob's should in general bet Kelly instead of expected value maximizing. I think an understanding of what ergodicity is, and whether a statistic is ergodic, helps to explain why. Given this, I also think that it makes sense to ask whether you should be looking for bets that are more ergodic in their ensemble average (like index funds rather than poker). In general, I find expectation maximization unsatisfying because I don't think it deals well with wipeout risk. Reading Ole Peters helped me understand why people were so excited about Kelly, and reading this article by Gwern helped me understand that I had been interpreting expectation maximization in a very limited way in the first place. In the limit of infinite bets like Bob's with no cap, myopic expectation maximization at each step means that most runs will go bankrupt. I don't find the extremely high returns in the infinitesimally probable regions to make up for that. I'd like a principled way of expressing that which doesn't rely on having a specific type of utility function, and I think Peters' ergodicity economics gets most but not all the way there. Other than that, I don't disagree with anything you've said.
This isn't right unless I'm missing something - Kelly provides the fastest growth, while betting everything on every round is almost certain to bankrupt you.

If Bob wants to maximise his money at the end, then he really should bet it all every round. I don't see why you would want to use Kelly rather than maximising expected utility. Not maximising expected utility means that you expect to get less utility.

Since writing the original post, I've found Gwern's post about a solution to something almost identical to Bob's problem. In this post, he creates a decision tree for every possible move starting from the first one, determining final value at the leaf nodes. He then uses the Bellman equation and traditional expected value to back out what you should do in the earliest moves. The answer is that you bet approximately Kelly. Gwern's takeaway here is (I think) that expected value always works, but you have to make sure you're solving the right problem. Using expected value naively at each step, discounting the temporal nature of the problem, leads to ruin. I think many of the more philosophical points in my original post still stand, as doing backwards induction even on this toy problem is pretty difficult (it took his software 16 hours to find the solution). Collapsing a time series expected value problem to a one-shot Kelly problem saves a lot of effort, but to do that you need an ergodic statistic. Even once you've done that, you should still make sure the game is worth playing before you actually start betting.
This isn't actually right though - the concept of maximizing utility doesn't quite overlap with expecting to have more or less utility at the end. There are many examples where maximizing your expected utility means expecting to go broke, and not maximizing it means expecting to end up with more money. (Even though, in this particular one-turn example, Bob should, in fact, expect to end up with more money if he bets everything.)
The problem with maximising expected utility is that Bob will sit their playing 1 more round, then another 1 more round again and again until he eventually looses everything. Each step maximised the expected utility, but the policy overall guarantees zero utility with certainty, assuming Bob never runs out of time. But, even as utility-maximising-Bob is saved from self-destruction by the clock, he shall think to himself "dam it! Out of time. That is really annoying, I want to keep doing this bet". At least to me Kelly betting fits in the same kind of space as the Newcomb paradox and (possibly) the prisoners dilemma. They all demonstrate that the optimal policy is not necessarily given by a sequence of optimal actions at every step.
Well put. I agree that we should try to maximize the value that we expect to have after playing the game. My claim here is that just because a statistic is named "expected value" doesn't mean it's accurately representing what we expect to happen in all types of situations. In Alice's game, which is ergodic, traditional ensemble-averaging based expected value is highly accurate. The more tickets Alice buys, the more her actual value converges to the expected value. In Bob's game, which is non-ergodic, ensemble-based expected value is a poor statistic. It doesn't actually predict the value that he would have. There's no convergence between Bob's value and "expected value", so it seems strange to say that Bob "expects to get" the result of the ensemble average here. You can certainly calculate Bob's ensemble average, and it will have a higher result than the temporal average (as I state in my post). My claim is that this doesn't help you, because it's not representative of Bob's game at all. In those situations, maximizing temporal average is the best you can do in reality, and the Kelly criterion maximizes that. Trying to maximize ensemble-based expected value here will wipe you out.

Can you be more precise about the exact situation Bob is in? How many rounds will he get to play? Is he trying to maximise money, or trying to beat Alice? I doubt the Kelly criterion will actually be his optimal strategy.

I wrote this with the assumption that Bob would care about maximizing his money at the end, and that there would be a high but not infinite number of rounds. On my view, your questions mostly don't change the analysis much. The only difference I can see is that if he literally only cares about beating Alice, he should go all in. In that case, having $1 less than Alice is equivalent to having $0. That's not really how people use money though, and seems pretty artificial. How are you expecting these answers to change things?

I tend to view the golden ratio as the least irrational irrational number. It fills in the next gap after all the rational numbers. In the same way, 1/2 is the noninteger which shares the most algebraic properties with the integers, even though it's furthest from them in a metric sense.

Nice idea! We can show directly that each term provides information about the next.

The density function of the distribution of the fractional part in the continued fractional algorithm converges to 1/[(1+x) ln(2)] (it seems this is also called the Gauss-Kuzmin distribution, since the two are so closely associated). So we can directly calculate the probability of getting a coefficient of n by integrating this from 1/(n+1) to 1/n, which gives -lg(1-1/(n+1)^2) as you say above. But we can also calculate the probability of getting an n followed by an m, by int... (read more)

Huh, an extra reason why the golden ratio is the "most irrational"/most unusual irrational number.
5Adam Scherlis1y
Thanks! That's surprisingly straightforward.

I've only been on Mastodon a bit longer than the current Twitter immigrants, but as far as I know there's no norm against linking. But the server admins are all a bit stressed by the increased load. So I can understand why they'd be annoyed by any link that brought new users. I've been holding off on inviting new users to the instance I'm on, because the server is only just coping as it is.

It sounds like that by other people linking without consent they received unwanted attention: "I realised that some people had cross-posted my Mastodon post into Twitter. ... I struggled to understand what I was feeling, or the word to describe it. I finally realised on Monday that the word I was looking for was "traumatic". In October I would have interacted regularly with perhaps a dozen people a week on Mastodon, across about 4 or 5 different servers. Suddenly having hundreds of people asking (or not) to join those conversations without having acclimatised themselves to the social norms felt like a violation, an assault.* I guess what I'm trying to understand is whether Mastodon has some norm of "don't bring attention to people's writing without their consent"?

Apologies for asking an object level question, but I probably have Covid and I'm in the UK which is about to experience a nasty heatwave. Do we have a Covid survival guide somewhere?

(EDIT: I lived lol)

Is there a way to alter the structure of a futarchy to make it follow a decision theory other than EDT?

Or is that still too closely tied to the explore-exploit paradigm?

Right. The setup for my problem is the same as the 'bernoulli bandit', but I only care about the information and not the reward. All I see on that page is about exploration-exploitation.

What's the term for statistical problems that are like exploration-exploitation, but without the exploitation? I tried searching for 'exploration' but that wasn't it.

In particular, suppose I have a bunch of machines which each succeed or fail independently with a probability that is fixed separately for each machine. And suppose I can pick machines to sample to see if they succeed or fail. How do I sample them if I want to become 99% certain that I've found the best machine, while using the fewest number of samples?

The difference with exploration-exploitat... (read more)

From Algorithms to Live By, I vaguely recall the multi-armed bandit problem. Maybe that's what you're looking for? Or is that still too closely tied to the explore-exploit paradigm?

The problem is that this measures their amount of knowledge about the questions as well as their calibration.

My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes' rule we get a... (read more)

Agreed that the proposal combines knowledge with calibration, but your procedure doesn't actually seem implementable.

I was expecting a post about tetraethyllead.

You just have to carefully do the algebra to get an inductive argument. The fact that the last digit is 5 is used directly.

Suppose n is a number that ends in 5, and such that the last N digits stay the same when you square it. We want to prove that the last N+1 digits stay the same when you square n^2.

We can write n = m*10^N + p, where p has N digits, and so n^2 = m^2*10^(2N) + 2mp*10^N + p^2. Note that since 2p ends in 0, the term 2mp*10^N actually divides by 10^(N+1). Then since the two larger terms divide by 10^N+1, n^2 agrees with p^2 on its last N+1 d... (read more)


If we start with 5 and start squaring we get 5, 25, 625, 390625, 152587890625.... Note how some of the end digits are staying the same each time. If we continue this process we get a number ...8212890625 which is a solution of x^2 = x. We get another solution by subtracting this from 1 to get ...1888119476.

I may be missing something, but it is not obvious to me why the number of digits at the end that stay the same is necessarily always increasing. I mean, the last N digits of x^2 depend on the last N digits of x. But it requires an explanation why the last N+1 digits of x^2 would be determined by the last N digits of x. I can visualize a possibility that perhaps at certain place a digit in 5^(2^k) starts changing periodically. Why would that not be the case? Also, is there something special about 5, or is the same true for all numbers?

if not, then we'll essentially need another way to define determinant for projective modules because that's equivalent to defining an alternating map?

There's a lot of cases in mathematics where two notions can be stated in terms of each other, but it doesn't tell us which order to define things in.

The only other thought I have is that I have to use the fact that  is projective and finitely generated. This is equivalent to  being dualisable. So the definition is likely to use  somewhere.

I'm curious about this. I can see a reasonable way to define  in terms of sheaves of modules over : Over each connected component,  has some constant dimension , so we just let  be  over that component.

If we call this construction  then the construction I'm thinking of is . Note that  is locally -dimensional, so my construction is locally isomorphic to yours but globally twisted. It depends on  via more than just its... (read more)

Oh right, I was picturing W being free on connected components when I suggested that. Silly me. F is alternating if F(f∘g)=det(g)F(f), right? So if we're willing to accept kludgy definitions of determinant in the process of defining ΛW(V), then we're all set, and if not, then we'll essentially need another way to define determinant for projective modules because that's equivalent to defining an alternating map?

I agree that 'credence' and 'frequency' are different things. But round here the word 'probability' does refer to credence rather than frequency. This isn't a mistake; it's just the way we're using words.

Okay, but I've also seen rationalists use point estimates for probability in a way that led them to mess up Bayes, and such that it would be clear if they recognized the probability was uncertain (e.g., I saw this a few times related to covid predictions). I feel like it's weird to use "frequency" for something that will only happen (or not happen) once, like whether the first AGI will lead to human extinction, though ultimately I don't really care what word people are using for which concept.

Another thing to say is if  then


My intuition for  is that it tells you how an infinitesimal change accumulates over finite time (think compound interest). So the above expression is equivalent to . Thus we should think 'If I perturb the identity matrix, then the amount by which the unit cube grows is proportional to the extent to which each vector is being stretched in the direction it was already pointing'.

Hmm, this seems wrong but fixable. Namely, exp(A) is close to (I+A/n)^n, so raising both sides of det(exp(A))=exp(tr(A)) to the power of 1/n gives something like what we want. Still a bit too algebraic though, I wonder if we can do better.

Thank you for that intuition into the trace! That also helps make sense of .

Interesting, can you give a simple geometric explanation?

This is close to one thing I've been thinking about myself. The determinant is well defined for endomorphisms on finitely-generated projective modules over any ring. But the 'top exterior power' definition doesn't work there because such things do not have a dimension. There are two ways I've seen for nevertheless defining the determinant.

  • View the module as a sheaf of modules over the spectrum of the ring. Then the dimension is constant on each connected component, so you can take the top exterior power on each and then glue them back together.
  • Use the fact
... (read more)
I'm curious about this. I can see a reasonable way to define ΛW(V) in terms of sheaves of modules over Spec(R): Over each connected component, W has some constant dimension n, so we just let ΛW(V) be Λn(V) over that component. But it sounds like you might not like this definition, and I'd be interested to know if you had a better way of defining ΛW(V) (which will probably end up being equivalent to this). [Edit: Perhaps something in terms of generators and relations, with the generators being linear maps W→V?]

I think the determinant is more mathematically fundamental than the concept of volume. It just seems the other way around because we use volumes in every day life.

3Ege Erdil2y
I think the good abstract way to think about the determinant is in terms of induced maps on the top exterior power. If you have an n dimensional vector space V and an endomorphism L:V→V, this induces a map ∧nV→∧nV, and since ∧nV is always one-dimensional this map must be of the form v→kv for some scalar k in the ground field. It's this k that is the determinant of L. This is indeed more fundamental than the concept of volume. We can interpret exterior powers as corresponding to volume if we're working over a local field, for example, but actually the concept of exterior power generalizes far beyond this special case. This is why the determinant still preserves its nice properties even if we work over an arbitrary commtuative ring; since such rings still have exterior powers behaving in the usual way. I didn't present it like this in this post because it's actually not too easy to introduce the concept of "exterior power" without the post becoming too abstract.

I don't dispute what you say. I just suggest that the confusing term "in the worst case" be replaced by the more accurate phrase "supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated 'random'".

In this case P is the cumulative distribution function, so it has to approach 1 at infinity, rather than the area under the curve being 1. An example would be 1/(1+exp(-x)).

3Robert Kennedy2y
Actually, for any given P which works, P'(x)=P(x)/10 is also a valid algorithm.

A simple way to do this is for ROB to output the pair of integers {n, n+1} with probability K((K-1)/(K+1))^|n|, where K is some large number. Then even if you know ROB's strategy the best probability you have of winning is 1/2 + 1/(2K).

If you sample an event N times the variance in your estimate of its probability is about 1/N. So if we pick K >> √N then our probability of success will be statistically indistinguishable from 1/2.

The only difficulty is implementing code to sample from a geometric distribution with a parameter so close to 1.

The original problem doesn't say that ROB has access to your algorithm, or that ROB wants you to lose.

In such problems, it is usually assumed that your solution have to work (in this case work = better than 50% accuracy) always, even in the worst case, when all unknowns are against you.
From near the end of the post you linked: In the case of Robert's problem, since ROB is specified to have access to your algorithm, it effectively "moves second", which does indeed place it in a position to "use its intelligence on you". So it should be unsurprising that a mixed strategy can beat out a pure strategy. Indeed, this is why mixed strategies are sometimes optimal in game-theoretic situations: when adversaries are involved, being unpredictable has its advantages. Of course, this does presume that you have access to a source of randomness that even ROB cannot predict. If you do not have access to such a source, then you are in fact screwed. But then the problem becomes inherently unfair, and thus of less interest. (From a certain perspective, even this is compatible with Eliezer's main message: for an adversary to be looking over your shoulder, "moving second" relative to you, and anti-optimizing your objective function, is indeed a special case of things being "worse than random". The problem is that producing a "superior derandomized algorithm" in such a case requires inverting the "move order" of yourself and your adversary, which is not possible in many scenarios.)

Right, I should have chosen a more Bayesian way to say it, like 'suceeds with probability greater than 1/2’.

 The intended answer for this problem is the Frequentist Heresy in which ROB's decisions are treated as nonrandom even though they are unknown, while the output of our own RNG is treated as random, even though we know exactly what it is, because it was the output of some 'random process'.

Instead, use the Bayesian method. Let P({a,b}) be your prior for ROB's choice of numbers. Let x be the number TABI gives you. Compute P({a,b}|x) using Bayes' Theorem. From this you can calculate P(x=max(a,b)|x). Say that you have the highest number if this is over 1/2

... (read more)

Some related discussions: 1. 2. 3.

My own thoughts.

  • Patterns in GoL are generally not robust. Typically changing anything will cause the whole pattern to disintegrate in a catastrophic explosion and revert to the usual 'ash' of randomly placed small still lifes and oscillators along with some escaping gliders.

  • The pattern Eater 2 can eat gliders along 4 adjacent lanes.

... (read more)

Yes, I'm a big fan of the Entropic Uncertainty Principle. One thing to note about it is that the definition of entropy only uses the measure space structure of the reals, whereas the definition of variance also uses the metric on the reals as well. So Heisenberg's principle uses more structure to say less stuff. And it's not like the extra structure is merely redundant either. You can say useful stuff using the metric structure, like Hardy's Uncertainty Principle. So Heisenberg's version is taking useful information and then just throwing it away.

I'd almos... (read more)

A good source for the technology available in the Game of Life is the draft of Nathaniel Johnston and Dave Greene's new book "Conway’s Game of Life: Mathematics and Construction".

3Ramana Kumar2y
Thanks! I'd had a bit of a look through that book before and agree it's a great resource. One thing I wasn't able to easily find is examples of robust patterns. Does anyone know if there's been much investigation of robustness in the Life community? The focus I've seen seems to be more on particular constructions (used in its entirety as the initial state for a computation), rather than on how patterns fare when placed in various ranges of different contexts.

The if the probabilities of catching COVID on two occasions are x and y, the the probability of catching it at least once is 1 - (1 - x)(1 - y) which equals x + y - xy. So if x and y are large enough for xy to be significant, then splitting is better because even though catching it the second time will increase your viral load, it's not going to make it twice as bad as it already was.

The link still works for me. Perhaps you must first become a member of that discord? Invite link: (valid for 7 days)

Thanks. I also found an invite link in a recent reddit post about this discussion (was that by you?).

The weird thing is that there are two metrics involved: information can propagate through a nonempty universe at 1 cell per generation in the sense of the l_infinity metric, but it can only propagate into empty space at 1/2 a cell per generation in the sense of the l_1 metric.

You're probably right, but I can think of the following points.

Its rule is more complicated than Life's, so its worse as an example of emergent complexity from simple rules (which was Conway's original motivation).

It's also a harder location to demonstrate self replication. Any self replicator in Critters would have to be fed with some food source.

Yeah, although probably you'd want to include a 'buffer' at the edge of the region to protect the entity from gliders thrown out from the surroundings. A 1,000,000 cell thick border filled randomly with blocks at 0.1% density would do the job.

That seems great. Is there any reason people talk a lot about Life instead of Critters? (Seems like Critters also supports universal computers and many other kinds of machines. Are there any respects in which it is known to be less rich than Life?)

This is very much a heuristic, but good enough in this case.

Suppose we want to know how many times we expect to see a pattern with n cells in a random field of area A. Ignoring edge effects, there are A different offsets at which the pattern could appear. Each of these has a 1/2^n chance of being the pattern. So we expect at least one copy of the pattern if n < log_2(A).

In this case the area is (10^60)^2, so we expect patterns of size up to 398.631. In other words, we expect the ash to contain any pattern you can fit in a 20 by 20 box.

8Alex Flint3y
So just to connect this back to your original point: if we knew that it were possible to construct some kind of intelligent entity in a region with area of, say, 1,000,000,000 cells, then if our overall grid had 21,000,000,000 total cells and we initialized it at random, then we would expect an intelligent entity to pop up by chance at least once in the whole grid.

The glider moves at c/4 diagonally, while the c/2 ships move horizontally. A c/2 ship moving right and then down will reach its destination at the same time the c/4 glider does. In fact, gliders travel at the empty space speed limit.

Huh. Something about the way speed is calculated feels unintuitive to me, then.

Most glider guns in random ash will immediately be destroyed by the chaos they cause. Those that don't will eventually reach an eater which will neutralise them. But yes, such things could pose a nasty surprise for any AI trying to clean up the ash. When it removes the eater it will suddenly have a glider stream coming towards it! But this doesn't prove it's impossible to clear up the ash.

Making a 'ship sensor' is tricky. If it collides with something unexpected it will create more chaos that you'll have to clear up.

This sounds like you're treating the area as empty space, whereas the OP specifies that it's filled randomly outside the area where our AI starts.

OP said I can initialize a large chunk as I like (which I initialize to be empty aside from my constructors to avoid interfering with placing the pixels), and then the rest might be randomly or arbitrarily initialized, which is why I brought up the wall of still-life eaters to seal yourself off from anything that might then disrupt it. If his specific values don't give me enough space, but larger values do, then that's an answer to the general question as nothing hinges on the specific values.

My understanding was that we just want to succeed with high probability. The vast majority of configurations will not contain enemy AIs.

3Alex Flint3y
Yeah success with high probability was how I thought the question would need to be amended to deal with the case of multiple AIs. I mentioned this in the appendix but should have put it in the main text. Will add a note.
Load More