Automatic programming, an example

1st Feb 2012

15wmorgan

39jimrandomh

8RolfAndreassen

3Thomas

5FeepingCreature

2Thomas

1wmorgan

0Thomas

0Thomas

1jimrandomh

1Thomas

2Douglas_Knight

8jmmcd

5wmorgan

3shminux

4siodine

0loveandthecoexistenc

1Anubhav

0pedanterrific

4Incorrect

4Thomas

2shminux

15RolfAndreassen

1jmmcd

2Houshalter

0Thomas

1Dmytry

0Luke_A_Somers

-12911truther

8satt

-5911truther

3jmmcd

5A1987dM

4Viliam_Bur

-7911truther

7Viliam_Bur

New Comment

36 comments, sorted by Click to highlight new comments since: Today at 7:28 AM

The generated code is bizarre. I refactored it as well as I could, and it still doesn't make much sense:

```
aphelion = (aphelion + perihelion) >> 10;
aphelion = aphelion - (aphelion / 12);
guess = ( ( (aphelion | 12) * (int)sqrt(aphelion) ) ^ 12 ) / 12;
```

"To get the orbit time in days from the aphelion and perihelion in Kkm, first sum them and divide by 1024. Then from that, subtract one twelfth. Then, to the value, perform a bitwise OR with 0x0C, multiply by the square root, and bit-XOR 0x0C again. Finally, divide by 12, and that will give you the number of days."

The three problems with the code are that the variable names are all lies, there's a bunch of redundant rescaling which isn't consolidated because it's done in integer math when it should be floating point, and there are a couple bits of overfitting (bitwise operators) that don't belong. If you convert to SSA and wipe out the misleading names, you get:

```
a1 = perihelion+aphelion Real
a2 = a1+a1 Rescaling
a3 = a2+a2 Rescaling
g1 = 12 Rescaling
a4 = a3>>g1 Rescaling
t1 = a4/g1 Rescaling
a5 = a4-t1 Rescaling
d1 = sqrt(a5) Real
a6 = g1|a5 Overfit
a7 = a6*d1 Real
a8 = g1^a7 Overfit
guess = a8/g1 Rescaling
```

If you replace the overfitting with pass-throughs (a6=a5, a8=a7), then pretend it's floating point so that you can consolidate all the rescaling into a single multiplicative constant, you get

```
guess = k * (perihelion+aphelion)*sqrt(perihelion+aphelion)
```

Which is Kepler's third law.

This reminds me of the discussion from last week of the code that a self-modifying AI might produce; I said then that I thought people were not thinking Weirdly enough. This is indeed an example of Weirdness. Obviously no human would come up with such a thing. Yet it works, more or less; although the hardcoded numbers make me suspect that its range of applicability may not be great. Testing it on some numbers typical of, say, solar orbits around the galactic core, might produce some surprising results.

It would also be interesting to compare this for speed, as well as accuracy, with a Kepler's-law implementation.

No. The posted code has a bit shift right for 12 places. The already optimized code by wmorgan has a bit shift for only 10 bits.

The metacommand *$RESCOM if while val_operation inc_dec* caused this. Having two constants (10 and 12) would be undesirable be cause of this "val_operation" and therefore only the constant 12 was used.

This is the generated code segment:

```
aphelion=aphelion+aphelion;
aphelion=aphelion+aphelion;
guess=12;
aphelion=aphelion>>guess;
```

Those four lines together amount to a shift 10 bits to the right, i.e., division by 1024.

I think you understand what's going in the code. The point of my refactoring was to make something that was human-readable: something that I could describe in English. And the English for those four lines of code is "divide by 1024." That's what those four lines do.

We can modify the above code to:

$DECLAREINT aphelion perihelion dif guess temp $RINVAR aphelion(1000,100000) perihelion(1000,100000) $RETVAR guess if (aphelion>guess; temp=aphelion/guess; aphelion=aphelion-temp; dif=sqrt(aphelion); //aphelion=guess|aphelion; aphelion=aphelion*dif; //aphelion=guess^aphelion; guess=aphelion/guess; $EES As you can see, there is no $RESCOM metacommand and the two "overfit" OR and XOR lines has also been commented or neutralized. Random aphelions and perihelions are between 1 million and 100 million km now. If aphelion is greater than perihelion they are swapped first. The intermediate variable "temp" is then reset to 0 before the "Kepler segment" begins. If it wasn't reset, the simulator would simply use it! Simulator comes out with this in the $BES-$EES section:

aphelion=perihelion+aphelion; aphelion>>=10; guess=12; temp=aphelion/guess; aphelion=aphelion-temp; perihelion=sqrt(aphelion); perihelion=perihelion*aphelion; guess=perihelion/guess; Less lines, but now two constants (10 and 12) (approximately) scale days with mega-meters here, where the Sun is this massive. Initially, it was only the 12 constant, which shifted, divided and was also a XOR argument to fit some more.

Both codes here, inside the $BES-$EES segments are exactly equivalent regarding the outputs and are a distant ancestor-descendant pair. Many million generations apart, but not very much different.

[This comment is no longer endorsed by its author]

We can modify the above code to:

$DECLAREINT aphelion perihelion dif guess temp $RINVAR aphelion(1000,100000) perihelion(1000,100000) $RETVAR guess

if (aphelion<perihelion) { temp=perihelion; perihelion=aphelion; aphelion=temp; }

temp=0;

$BES
aphelion=perihelion+aphelion;
aphelion=aphelion+aphelion;
aphelion=aphelion+aphelion;
guess=12;
aphelion=aphelion>>guess;
temp=aphelion/guess;
aphelion=aphelion-temp;
dif=sqrt(aphelion);
//aphelion=guess|aphelion;
aphelion=aphelion*dif;
//aphelion=guess^aphelion;
guess=aphelion/guess;
$EES

As you can see, there is no $RESCOM metacommand and the two "overfit" OR and XOR lines has also been commented or neutralized. Random aphelions and perihelions are between 1 million and 100 million km now. If aphelion is greater than perihelion they are swapped first. The intermediate variable "temp" is then reset to 0 before the "Kepler segment" begins. If it wasn't, the simulator would simply use it!

Simulator comes out with this in the $BES-$EES section:

```
aphelion=perihelion+aphelion;
aphelion>>=10;
guess=12;
temp=aphelion/guess;
aphelion=aphelion-temp;
perihelion=sqrt(aphelion);
perihelion=perihelion*aphelion;
guess=perihelion/guess;
```

Less lines, but now two constants (10 and 12) (approximately) scale days and mega-meters here, where the Sun is so massive. Before, it was only the constant of 12, which shifted, divided and was also a XOR argument to fit some more.

Both codes here, inside the $BES-$EES segment are exactly equivalent regarding the output and are a distant ancestor-descendant pair. Many million generations apart.

[This comment is no longer endorsed by its author]

I know and I agree with this.

Then you have two different constants (10 and 12). One for the shifting and another for the division. It's nothing wrong with that, but the simulator was prevented to have more constants then absolutely necessary. So everything was done with the "12" and I was discussing that.

The XOR with 12 won't do much after dividing by 12. For small radii, OR with 12 (in units of about 10^6km) will have an effect. These two constants are probably just overfitting. Indeed, it nails Mercury, the smallest and thus most vulnerable to these effects.* Rounding** the square root is also probably overfitting or just noise. It will have a larger effect, but smooth across planets, so it is probably canceled out by the choice of other numbers. Ignoring those three effects, it is a constant times the 3/2 power of average of the axes. The deviation from Kepler's law is that it should ignore perihelion.*** But for un-eccentric orbits, there's no difference. Since the training data isn't eccentric, this failure is unsurprising. That is, the code is unsurprising; that the code is so accurate is surprising. That it correctly calculates the orbital period of Halley's comet, rather than underestimating by a factor of 2^(3/2) is implausible.***

* The control group is too homogeneous. If it contained something close in, overfitting for Mercury would have been penalized in the final evaluation.

** Are you sure it's rounding? [Edit: Yes: bitwise operations are strongly suggestive.]

*** These statements are wrong because I confused apehelion with the semi-major axis. So removing the bitwise operations yields exactly Kepler's law. If you switch from ints to doubles it becomes more accurate. But wmorgan has a constant error: it is divide by 4096, not 1024. This should make the rounding errors pretty bad for Mercury. Maybe the bitwise operations are to fix this, if they aren't noise. My C compiler does not reproduce the claimed error percentages, so I'm not going to pursue this.

This is a fairly typical example of *genetic programming*, albeit written in a slightly "outsider" way.

No criticism, but readers should be given the standard name of the technique.

EDIT in particular it's GP symbolic regression. On the other hand it's atypical in that there's no crossover.

Very interesting demonstration. Thanks for sharing this; it was fun to read through! I think I have a pretty good idea of how it works.

As a professional programmer:

That code it generated...is really, really shitty. It's unreadable, and for that reason, a human cannot look at the generated code and figure out "what's going on," i.e. Kepler's laws. Insofar as it works, it's much more reminiscent of 0x5f3759df, but that algorithm was optimizing for speed, not correctness or elegance.

I'm not surprised that the algorithm does worse on the control group, for the same reason that I'd question the assumption that it will do better on future generations. It could easily be over-fitting, partially because there is no selection pressure for an elegant solution. Empirically, elegant code does better in novel contexts.

"Evolution is dumb, but it works" -- I believe that showing that was the (implicit) goal, and it was amply demonstrated.

Also related: mirror.fem-net.de/CCC/28C3/mp4-h264-HQ/28c3-4764-en-automatic_algorithm_invention_with_a_gpu_h264.mp4.torrent - another outsider perspective explanation of genetic programming.

Also related - another outsider perspective explanation of genetic programming.

[This comment is no longer endorsed by its author]

Usually it then just notices a syntax error and repairs everything to the last working version

That sounds like a terrible waste. Why not use a tree-based model of C that excludes as many invalid programs as practical?

That sounds like a terrible waste.

Technically speaking this is a very swift part, since every syntax error is detected **before** the code is rewritten and the mutations are just abandoned.

Eureqa is a much better tool for this. It come up with the solution `Orbit = 0.01239*Aphelion`

in a few seconds. Some other solutions it found: http://i.imgur.com/tZwBedp.png?1

The genetic programming may work better for creating a strong AI than to solve that kind of problem. The intelligence did evolve, heh, heh.

If you are trying to evolve a program that solves something of this kind, you end up with an entirely wrong algorithm, constants in which are going to evolve to make the best outcome there can be. The solution just gets stuck in local maximum. Like human eye that has retina backwards. It is more productive to try to iterate over possible solutions starting from the simplest (to implement occam's razor), using some separate process (hill climbing from several random points?) to tune the constants to their best.

Given the overall lack of eccentricity among the planets, I suspect this would converge faster if you started over, switching the training and test groups.

Say, that we have the following observational data:

PlanetAphelion

000 kmPerihelion

Orbit time

daysThe minimal, the maximal distance between a planet and the Sun (both in thousands of kilometres) and the number of (Earth) days for one revolution around the Sun. Above is only the empirical data and no binding algorithm among the three quantities. The celestial mechanics rules which go by the name of the Kepler's laws. Can those rules be (re)invented by a computer program and how?

The following program code will be put into a simulator:

After several hours the following C code has been evolved inside of the $BES - $EES segment.

What the simulator does? It bombards the arena segment with a random C commands. Usually it then just notices a syntax error and repairs everything to the last working version. If everything is syntactically good, the simulator interprets the program and checks if the mutated version causes any run-time error like division by zero, a memory leak and so on. In the case of such an error it returns to the last good version. Otherwise it checks the variable called "bad", if it is at least as small as it was ever before. In the case it is, a new version has just been created and it is stored.

The evolutionary pressure is working toward ever better code, which increasingly well guesses the orbit time of nine planets. In this case the "orbit" variable has been under the $RESVAR clause and then the "gues" variable has been tested against the "orbit" variable. Had been no "$RESVAR orbit" statement, a simple "guess=orbit;" would evolve quickly. Had been no "$RESVAR bad" statement a simple "bad=-1000000;" could derail the process.

Many thousands of algorithms are born and die every second on a standard Windows PC inside this simulator. Million or billion generations later, the digital evolution is still running, even if an excellent solution has been already found.

And how good approximation for the Kepler (Newton) celestial mechanics of the Solar system we have here?

This good for the nine planets where the code evolved:

PlanetError %And this good for the control group of a comet and six asteroids:

Asteroid/CometError %Could be even much better after another billion generations and maybe with even more $INVAR examples. Generally, you can pick any three columns from any integer type table you want. And see this way, how they are related algorithmically. Can be more than three columns also.

The name of the simulator (evoluator) is Critticall and it is available at http://www.critticall.com