VojtaKovarik

Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.

Wiki Contributions

Comments

(1) « people liking thing does not seem like a relevant parameter of design ».

This is quite a bold statement. I personally believe the mainstream theory according to which it’s easier to have designs adopted when they are liked by the adopters.

Fair point. I guess "not relevant" is a too strong phrasing. And it would have been more accurate to say something like "people liking things might be neither sufficient nor necessary to get designs adopted, and it is not clear (definitely at least to me) how much it matters compared to other aspects".

 

Re (2): Interesting. I would be curious to know to what extent this is just a surface-level-only metaphor, or unjustified antrophomorphisation of cells, vs actually having implications for AI design. (But I don't understand biology at all, so I don't really have a clue :( .)

I agree that the general point (biology needs to address similar issues, so we can use it for inspiration) is interesting. (Seems related to https://www.pibbss.ai/ .)

That said, I am somewhat doubtful about the implied conclusion (that this is likely to help with AI, because it won't mind): (1) there are already many workspace practices that people don't like, so "people liking things" doesn't seem like a relevant parameter of design, (2) (this is totally vague, handwavy, and possibly untrue, but:) biological processes might also not "like" being surveiled, replaced, etc, so the argument proves too much.

What did you think the purpose was, that would be better served by that stuff you listed?

I think the purpose is the same thing that you say it is, an example of an equilibrium that is "very close" to the worst possible outcome. But I would additionally prefer if the example did not invoke the reaction that it critically relies on quirky mathematical details. (And I would be fine if this additional requirement came at the cost of the equilibrium being "90% of the way towards worst possible outcome", rather than 99% of the way.)

Oh, I missed that --- I thought they set it to 100 forever. In that case, I was wrong, and this indeed works as a mechanism for punishing non-punishers, at least from the mathematical point of view.

Mathematics aside, I still think the example would be clearer if there were explicit mechanisms for punishing individuals. As it is, the exact mechanism critically relies on details of the example, and on mathematical nitpicks which are unintuitive. If you instead had explicit norms, meta-norms, etc, you would avoid this. (EG, suppose anybody can punish anybody else by 1, for free. And the default is that you don't do it, except that there is the rule for punishing rule-breakers (incl. for this rule).)

Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a strict subset of the features, all of the non-specified features get thrown under the buss. (B) However, if you can optimise things gradually, then you can alternate which features you focus on, and somehow things will end up being pretty okay.

 

Personal note: Because of the assumption (2), I find the result (A) extremely unsurprising, and perhaps misleading. Yes, it is true that at the Pareto-frontier of resource allocation, there is no space for positive-sum interactions (ie, getting better on some axis must hurt us on some other axis). But the assumption (2) instead claims that positive-sum interactions are literally never possible. This is clearly untrue in the real-world, about things we care about.

That said, I find the result (B) quite interesting, and I don't mean to hate on the paper :-).

I suspect that, in this particular example, it is more about reasoning about subgame-perfection being unintuitive (and the absence of a mechanism for punishing people who don't punish "defectors").

Indeed, sounds relevant. Though note that from a technical point of view,  Scott's example arguably fails the "individual rationality" condition, since some people would prefer to die over 8h/day of shocks. (Though presumably you can figure out ways of modifying that thought example to "remedy" that.)

Two comments to this:
1) The scenario described here is a Nash equilibrium but not a subgame-perfect Nash equilibrium. (IE, there are counterfactual parts of the game tree where the players behave "irationally".) Note that subgame-perfection is orthogonal with "reasonable policy to have", so the argument "yeah, clearly the solution is to always require subgame-perfection" does not work. (Why is it orthogonal? First, the example from the post shows a "stupid" policy that isn't subgame-perfect. However, there are cases where subgame-imperfection seems smart, because it ensures that those counterfactual situations don't become reality. EG, humans are somewhat transparent to each other, so having the policy of refusing unfair splits in the Final Offer / Ultimatum game can lead to not being offered unfair splits in the first place.)

2) You could modify the scenario such that the "99 equilibrium" becomes more robust. (EG, suppose the players have a way of paying a bit to punish a specific player a lot. Then you add the norm of turning temperature to 99, the meta-norm of punishing defectors, the meta-meta-norm of punishing those who don't punish defectors, etc. And tadaaaa, you have a pretty robust hell. This is a part of how society actually works, except usually those norms typically enforce pro-social behaviour.)

But yes, the event organizers will be writing a paper about it and publishing the data (after it’s been anonymized).

I imagine this would primarily be a report from the competition? What I was thinking about was more about how this sort of assessment should be done in general, what are the similarities and differences between cybersecurity, and how to squeeze more utility out of it. For example, a (naive version of) one low-hanging fruit is to withhold 10% of the obtained data (from the AI companies, then test those jailbreak strategies later). This would give us some insight into whether the current "alignment" methods generalise, or whether we are closer to playing whack-a-mole. Similarly to how we use test data in ML.

There are many more considerations, and many more things you can do. And I don't claim to have all the answers, nor to be the optimal person to be writing about them. Just that it would be good if somebody was doing that (and wondering whether that is happening :-) ).

Red teaming has always been a legitimate academic thing? I don’t know what background you’re coming from but… you’re very far off.

Theoretical CS/AI/game theory, rather than cybersecurity. Given the lack of cybersec background, I acknowledge I might be very far off.

To me, it seems that the perception in cybersecurity might be different from the perception outside of it. Also, red teaming in the context of AI models might have important differences from cybersecurity context. Also, red teaming by public seems, to me, different from internal red-teaming or bounties. (Though this might be one of the things where I am far off.)

Load More