This is a special post for quick takes by Darklight. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

12 comments, sorted by Click to highlight new comments since: Today at 7:05 PM

I recently interviewed with Epoch, and as part of a paid work trial they wanted me to write up a blog post about something interesting related to machine learning trends. This is what I came up with:

http://www.josephius.com/2022/09/05/energy-efficiency-trends-in-computation-and-long-term-implications/

Recently I tried out an experiment using the code from the Geometry of Truth paper to try to see if using simple label words like "true" and "false" could substitute for the datasets used to create truth probes. I also tried out a truth probe algorithm based on classifying with the higher cosine similarity to the mean vectors.

Initial results seemed to suggest that the label word vectors were sorta acceptable, albeit not nearly as good (around 70% accurate rather than 95%+ like with the datasets). However, testing on harder test sets showed much worse accuracy (sometimes below chance, somehow). So I can probably conclude that the label word vectors alone aren't sufficient for a good truth probe.

Interestingly, the cosine similarity approach worked almost identically well as the mass mean (aka difference in means) approach used in the paper. Unlike the mass mean approach though, the cosine similarity approach can be extended to a multi-class situation. Though, logistic regression can also be extended similarly, so it may not be particularly useful either, and I'm not sure there's even a use case for a multi-class probe. 

Anyways, I just thought I'd write up the results here in the unlikely event someone finds this kind of negative result as useful information.

Okay, so I decided to do an experiment in Python code where I modify the Iterated Prisoner's Dilemma to include Death, Asymmetric Power, and Aggressor Reputation, and run simulations to test how different strategies do. Basically, each player can now die if their points falls to zero or below, and the payoff matrix uses their points as a variable such that there is a power difference that affects what happens. Also, if a player defects first in any round of any match against a non-aggressor, they get the aggressor label, which matters for some strategies that target aggressors. 

Long story short, there's a particular strategy I call Avenger, which is Grim Trigger but also retaliates against aggressors (even if the aggression was against a different player) that ensures that the cooperative strategies (ones that never defect first against a non-aggressor) win if the game goes enough rounds. Without Avenger though, there's a chance that a single Opportunist strategy player wins instead. Opportunist will Defect when stronger and play Tit-For-Tat otherwise.

I feel like this has interesting real world implications.

Interestingly, Enforcer, which is Tit-For-Tat but also opens with Defect against aggressors, is not enough to ensure the cooperative strategies always win. For some reason you need Avenger in the mix.

Edit: In case anyone wants the code, it's here.

So, the aggressor tag is a way to keep memory across games, so they're not independent.  I wonder what happens when you start allowing more complicated reputation (including false accusations of aggression).

I feel like any interesting real-world implications are probably fairly tenuous.  I'd love to hear some and learn that I'm wrong.

So, I adjusted the aggressor system to work like alliances or defensive pacts instead of a universal memory tag. Basically, now players make allies when they both cooperate and aren't already enemies, and make enemies when defected against first, which sets all their allies to also consider the defector an enemy. This, doesn't change the result much. The alliance of nice strategies still wins the vast majority of the time.

I also tried out false flag scenarios where 50% of the time the victim of a defect first against non-enemy will actually be mistaken for the attacker. This has a small effect. There is a slight increase in the probability of an Opportunist strategy winning, but most of the time the alliance of nice strategies still wins, albeit with slightly fewer survivors on average.

My guess for why this happens is that nasty strategies rarely stay in alliances very long because they usually attack a fellow member at some point, and eventually, after sufficient rounds one of their false flag attempts will fail and they will inevitably be kicked from the alliance and be retaliated against.

The real world implications of this remain that it appears that your best bet of surviving in the long run as a person or civilization is to play a nice strategy, because if you play a nasty strategy, you are much less likely to survive in the long run.

In the limit, if the nasty strategies win, there will only be one survivor, dog eat dog highlander style, and your odds of being that winner are 1/N, where N is the number of players. On the other hand, if you play a nice strategy, you increase the strength of the nice alliance, and when the nice alliance wins as it usually does, you're much more likely to be a survivor and have flourished together.

My simulation currently by default has 150 players, 60 of which are nice. On average about 15 of these survive to round 200, which is a 25% survival rate. This seems bad, but the survival rate of nasty strategies is less than 1%. If I switch the model to use 50 Avengers and 50 Opportunists, on average 25 Avengers survive to zero Opportunists, a 50% survival rate for the Avengers.

Thus, increasing the proportion of starting nice players increases the odds of nice players surviving, so there is an incentive to play nice.

Admittedly this is a fairly simple set up without things like uncertainty and mistakes, so yes, it may not really apply to the real world. I just find it interesting that it implies that strong coordinated retribution can, at least in this toy set up, be useful for shaping the environment into one where cooperation thrives, even after accounting for power differentials and the ability to kill opponents outright, which otherwise change the game enough that straight Tit-For-Tat doesn't automatically dominate.

It's possible there are some situations where this may resemble the real world. Like, if you ignore mere accusations and focus on just actual clear cut cases where you know the aggression has occurred, such as with countries and wars, it seems to resemble how alliances form and retaliation occurs when anybody in the alliance is attacked?

I personally also see it as relevant for something like hypothetical powerful alien AGIs that can see everything that happens from space, and so there could be some kind of advanced game theoretic coordination at a distance with this. Though that admittedly is highly speculative.

It would be nice though if there was a reason to be cooperative even to weaker entities as that would imply that AGI could possibly have game theoretic reasons not to destroy us.

Update: I made an interactive webpage where you can run the simulation and experiment with a different payoff matrix and changes to various other parameters.

I have some ideas and drafts for posts that I've been sitting on because I feel somewhat intimidated by the level of intellectual rigor I would need to put into the final drafts to ensure I'm not downvoted into oblivion (something a younger me experienced in the early days of Less Wrong).

Should I try to overcome this fear, or is it justified?

For instance, I have a draft of a response to Eliezer's List of Lethalities post that I've been sitting on since 2022/04/11 because I doubted it would be well received given that it tries to be hopeful and, as a former machine learning scientist, I try to challenge a lot of LW orthodoxy about AGI in it. I have tremendous respect for Eliezer though, so I'm also uncertain if my ideas and arguments aren't just hairbrained foolishness that will be shot down rapidly once exposed to the real world, and the incisive criticism of Less Wrongers.

The posts here are also now of such high quality that I feel the bar is too high for me to meet with my writing, which tends to be more "interesting train-of-thought in unformatted paragraphs" than the "point-by-point articulate with section titles and footnotes" style that people tend to employ.

Anyone have any thoughts?

Personally, I find shortform to be an invaluable playground for ideas. When I get downvoted, it feels lower stakes. It's easier to ignore aloof and smugnorant comments, and easier to update on serious/helpful comments. And depending on how it goes, I sometimes just turn it into a regular post later, with a note at the top saying that it was adapted from a shortform.

If you really want to avoid smackdowns, you could also just privately share your drafts with friends first and ask for respectful corrections.

Spitballing other ideas, I guess you could phrase your claims as questions, like "have objections X, Y, or Z been discussed somewhere already? If so, can anyone link me to those discussions?" Seems like that could fail silently though, if an over-eager commenter gives you a link to low-quality discussion. But there are pros and cons for every course of action/inaction.

I was recently trying to figure out a way to calculate my P(Doom) using math. I initially tried just making a back of the envelope calculation by making a list of For and Against arguments and then dividing the number of For arguments by the total number of arguments. This led to a P(Doom) of 55%, which later got revised to 40% when I added more Against arguments. I also looked into using Bayes Theorem and actual probability calculations, but determining P(E | H) and P(E) to input into P(H | E) = P(E | H) * P(H) / P(E) is surprisingly hard and confusing.

So, I had a thought.  The glory system idea that I posted about earlier, if it leads to a successful, vibrant democratic community forum, could actually serve as a kind of dataset for value learning.  If each post has a number attached to it that indicates the aggregated approval of human beings, this can serve as a rough proxy for a kind of utility or Coherent Aggregated Volition.

Given that individual examples will probably be quite noisy, but averaged across a large amount of posts, it could function as a real world dataset, with the post content being the input, and the post's vote tally being the output label.  You could then train a supervised learning classifier or regressor that could then be used to guide a Friendly AI model, like a trained conscience.

This admittedly would not be provably Friendly, but as a vector of attack for the value learning problem, it is relatively straightforward to implement and probably more feasible in the short-run than anything else I've encountered.

Another thought is that maybe Less Wrong itself, if it were to expand in size and become large enough to roughly represent humanity, could be used as such a dataset.