Davidmanheim

Comments

Eight claims about multi-agent AGI safety

My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.

A vastly faster vaccine rollout

One key limitation for vaccines is supply, as others have noted. That certainly doesn't explain everything, but it does explain a lot. 

This obstacle was, of course, completely foreseeable, and we proposed a simple way to deal with the problem, which we presented to policymakers and even posted on Lesswrong, by the end of April. 

Thus beings our story.

Unfortunately, we couldn't get UK policymakers on board when we discussed it, and the US was doing "warp speed" and Congress wasn't going to allocate money for a new idea. 

We were told that in general policymakers wanted an idea published / peer reviewed before they'd take the idea more seriously, so we submitted a paper. At this point, as a bonus, Preprints.org refused to put the preprint online. (No, really. And they wouldn't explain.) 

We submitted it as a paper to Vaccine May 20th, and they sent it for review, we got it back mid-june, did revisions and resubmitted early July, then the journal changed its mind and said "your paper does not appear to conduct original research, thus it does not fit the criteria." After emailing to ask what they were doing, they relented and said we could cut the length in half and re-submit as an opinion piece.  

We went elsewhere, to a newer, open access, non-blinded review journal, and it was finally online in October, fully published: https://f1000research.com/articles/9-1154

The Case for a Journal of AI Alignment

In the spirit of open peer review, here are a few thoughts:

First, overall, I was convinced during earlier discussions that this is a bad idea - not because of costs, but because the idea lacks real benefits, and itself will not serve the necessary functions. Also see this earlier proposal (with no comments). There are already outlets that allow robust peer review, and the field is not well served by moving away from the current CS / ML dynamic of arXiv papers and presentations at conferences, which allow for more rapid iteration and collaboration / building on work than traditional journals - which are often a year or more out of date as of when they appear. However, if this were done, I would strongly suggest doing it as an arXiv overlay journal, rather than a traditional structure.

One key drawback you didn't note is that allowing AI safety further insulation from mainstream AI work could further isolate it. It also likely makes it harder for AI-safety researchers to have mainstream academic careers, since narrow journals don't help on most of the academic prestige metrics.

Two more minor disagreement are about first, the claim that  "If JAA existed, it would be a great place to send someone who wanted a general overview of the field." I would disagree - in field journals are rarely as good a source as textbooks or non-technical overview. Second, the idea that a journal would provide deeper, more specific, and better review than Alignment forum discussions and current informal discussions seems farfetched given my experience publishing in journals that are specific to a narrow area, like Health security, compared to my experience getting feedback on AI safety ideas.

Eight claims about multi-agent AGI safety

Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.

I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)

Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static environment of humans. It does exist in a co-evolutionary multi-agent dynamic - so there is at least some reason for optimism within a multi-agent group, rather than between computational agents and humans - but the conditions for cooperation versus competition seem at least somewhat fragile.

Eight claims about multi-agent AGI safety

Strongly agree that it's unclear that there failures would be detected. 
For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm 

Eight claims about multi-agent AGI safety

Another possible argument is that we can't tell when multiple AIs are failing or subverting each other.
Each agent pursuing its own goals in a multi-agent environment are intrinsically manipulative, and when agents are manipulating one another, it happens in ways that we do not know how to detect or consider. This is somewhat different than when they manipulate humans, where we have a clear idea of what does and does not qualify as harmful manipulation.

Vanessa Kosoy's Shortform

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some  such that .

(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

Reason as memetic immune disorder

Those aren't actually how orthodox Jews interpret the rules, or apply them nowadays. Tassels are only on very specific articles of clothing, which are hidden under people's shirts, I'm not even sure what "tying money to yourself" is about, adulterers are only stoned if the temple stands and only under nearly-impossible to satisfy conditions, trees less than 5 years old are only considered a biblical problem in Israel, and if you're unsure, the fruit is allowed in the rest of the world, and the ritual purity laws don't apply in general because everyone is assumed to be contaminated anyways.

What evidence will tell us about the new strain? How are you updating?

'Has been spotted in' isn't much to work with.


Agreed, but note that the US just found its first case, and it is community-acquired, plus we aren't doing anything to stop importation, so I'm assuming it's everywhere already, and just starting the exponential phase. 

(Note that I cannot find good public data for spread within the UK, which would be the key way to update about the strain.)

Load More