An eccentric dreamer in search of truth and happiness for all. Formerly posted on Felicifia back in the day under the same name. Been a member of Less Wrong and involved in Effective Altruism since roughly 2013.

Wiki Contributions


I'm wondering what people's opinions are on how urgent alignment work is. I'm a former ML scientist who previously worked at Maluuba and Huawei Canada, but switched industries into game development, at least in part to avoid contributing to AI capabilities research. I tried earlier to interview with FAR and Generally Intelligent, but didn't get in. I've also done some cursory independent AI safety research in interpretability and game theoretic ideas my spare time, though nothing interesting enough to publish yet.

My wife also recently had a baby, and caring for him is a substantial time sink, especially for the next year until daycare starts. Is it worth considering things like hiring a nanny, if it'll free me up to actually do more AI safety research? I'm uncertain if I can realistically contribute to the field, but I also feel like AGI could potentially be coming very soon, and maybe I should make the effort just in case it makes some meaningful difference.

Thanks for the reply!

So, the main issue I'm finding with putting them all into one proposal is that there's a 1000 character limit on the main summary section where you describe the project, and I cannot figure out how to cram multiple ideas into that 1000 characters without seriously compromising the quality of my explanations for each.

I'm not sure if exceeding that character limit will get my proposal thrown out without being looked at though, so I hesitate to try that. Any thoughts?

I already tried discussing a very similar concept I call Superrational Signalling in this post. It got almost no attention, and I have doubts that Less Wrong is receptive to such ideas.

I also tried actually programming a Game Theoretic simulation to try to test the idea, which you can find here, along with code and explanation. Haven't gotten around to making a full post about it though (just a shortform).

So, I have three very distinct ideas for projects that I'm thinking about applying to the Long Term Future Fund for. Does anyone happen to know if it's better to try to fit them all into one application, or split them into three separate applications?

Recently I tried out an experiment using the code from the Geometry of Truth paper to try to see if using simple label words like "true" and "false" could substitute for the datasets used to create truth probes. I also tried out a truth probe algorithm based on classifying with the higher cosine similarity to the mean vectors.

Initial results seemed to suggest that the label word vectors were sorta acceptable, albeit not nearly as good (around 70% accurate rather than 95%+ like with the datasets). However, testing on harder test sets showed much worse accuracy (sometimes below chance, somehow). So I can probably conclude that the label word vectors alone aren't sufficient for a good truth probe.

Interestingly, the cosine similarity approach worked almost identically well as the mass mean (aka difference in means) approach used in the paper. Unlike the mass mean approach though, the cosine similarity approach can be extended to a multi-class situation. Though, logistic regression can also be extended similarly, so it may not be particularly useful either, and I'm not sure there's even a use case for a multi-class probe. 

Anyways, I just thought I'd write up the results here in the unlikely event someone finds this kind of negative result as useful information.

Update: I made an interactive webpage where you can run the simulation and experiment with a different payoff matrix and changes to various other parameters.

So, I adjusted the aggressor system to work like alliances or defensive pacts instead of a universal memory tag. Basically, now players make allies when they both cooperate and aren't already enemies, and make enemies when defected against first, which sets all their allies to also consider the defector an enemy. This, doesn't change the result much. The alliance of nice strategies still wins the vast majority of the time.

I also tried out false flag scenarios where 50% of the time the victim of a defect first against non-enemy will actually be mistaken for the attacker. This has a small effect. There is a slight increase in the probability of an Opportunist strategy winning, but most of the time the alliance of nice strategies still wins, albeit with slightly fewer survivors on average.

My guess for why this happens is that nasty strategies rarely stay in alliances very long because they usually attack a fellow member at some point, and eventually, after sufficient rounds one of their false flag attempts will fail and they will inevitably be kicked from the alliance and be retaliated against.

The real world implications of this remain that it appears that your best bet of surviving in the long run as a person or civilization is to play a nice strategy, because if you play a nasty strategy, you are much less likely to survive in the long run.

In the limit, if the nasty strategies win, there will only be one survivor, dog eat dog highlander style, and your odds of being that winner are 1/N, where N is the number of players. On the other hand, if you play a nice strategy, you increase the strength of the nice alliance, and when the nice alliance wins as it usually does, you're much more likely to be a survivor and have flourished together.

My simulation currently by default has 150 players, 60 of which are nice. On average about 15 of these survive to round 200, which is a 25% survival rate. This seems bad, but the survival rate of nasty strategies is less than 1%. If I switch the model to use 50 Avengers and 50 Opportunists, on average 25 Avengers survive to zero Opportunists, a 50% survival rate for the Avengers.

Thus, increasing the proportion of starting nice players increases the odds of nice players surviving, so there is an incentive to play nice.

Admittedly this is a fairly simple set up without things like uncertainty and mistakes, so yes, it may not really apply to the real world. I just find it interesting that it implies that strong coordinated retribution can, at least in this toy set up, be useful for shaping the environment into one where cooperation thrives, even after accounting for power differentials and the ability to kill opponents outright, which otherwise change the game enough that straight Tit-For-Tat doesn't automatically dominate.

It's possible there are some situations where this may resemble the real world. Like, if you ignore mere accusations and focus on just actual clear cut cases where you know the aggression has occurred, such as with countries and wars, it seems to resemble how alliances form and retaliation occurs when anybody in the alliance is attacked?

I personally also see it as relevant for something like hypothetical powerful alien AGIs that can see everything that happens from space, and so there could be some kind of advanced game theoretic coordination at a distance with this. Though that admittedly is highly speculative.

It would be nice though if there was a reason to be cooperative even to weaker entities as that would imply that AGI could possibly have game theoretic reasons not to destroy us.

Okay, so I decided to do an experiment in Python code where I modify the Iterated Prisoner's Dilemma to include Death, Asymmetric Power, and Aggressor Reputation, and run simulations to test how different strategies do. Basically, each player can now die if their points falls to zero or below, and the payoff matrix uses their points as a variable such that there is a power difference that affects what happens. Also, if a player defects first in any round of any match against a non-aggressor, they get the aggressor label, which matters for some strategies that target aggressors. 

Long story short, there's a particular strategy I call Avenger, which is Grim Trigger but also retaliates against aggressors (even if the aggression was against a different player) that ensures that the cooperative strategies (ones that never defect first against a non-aggressor) win if the game goes enough rounds. Without Avenger though, there's a chance that a single Opportunist strategy player wins instead. Opportunist will Defect when stronger and play Tit-For-Tat otherwise.

I feel like this has interesting real world implications.

Interestingly, Enforcer, which is Tit-For-Tat but also opens with Defect against aggressors, is not enough to ensure the cooperative strategies always win. For some reason you need Avenger in the mix.

Edit: In case anyone wants the code, it's here.

I was recently trying to figure out a way to calculate my P(Doom) using math. I initially tried just making a back of the envelope calculation by making a list of For and Against arguments and then dividing the number of For arguments by the total number of arguments. This led to a P(Doom) of 55%, which later got revised to 40% when I added more Against arguments. I also looked into using Bayes Theorem and actual probability calculations, but determining P(E | H) and P(E) to input into P(H | E) = P(E | H) * P(H) / P(E) is surprisingly hard and confusing.

Load More