Beth Barnes


AI safety via market making

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

Learning Normativity: A Research Agenda

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope that natural language is sufficiently universal. It's quite likely I'm either confused or being sloppy though.

You could put 'learning the prior' inside HCH I think, it would just be inefficient - for every claim, you'd ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you'd seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question - you just do all the learning from data once, then use the result of that to answer any subsequent questions.

Learning Normativity: A Research Agenda

However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there... but as I've mentioned earlier, human imitation does not get us all the way. Humans don't perfectly endorse their own reactions.

Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human - e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity 'correct', but to design a mechanism such that we can't expect to find any mechanism that does better.

Extortion beats brinksmanship, but the audience matters

FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I've never seen it spelt with the s) and this was a little confusing. 

AI safety via market making

Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it. 
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node. 
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?


AI safety via market making

Suppose by strong induction that  always gives the right answer immediately for all sets of size less than 

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n. 

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded. 

AI safety via market making

I think for debate you can fix the circular argument problem by requiring debaters to 'pay' (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don't resolve before the depth limit, then they'll lose.  

AGI safety from first principles: Goals and Agency

But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.

I think the ways in which humans are not fully consequentialist is much broader - we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we're imitating someone else, etc. 

Writeup: Progress on AI Safety via Debate

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

I'm not completely sure whether Scott's method would work given how we're defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner 'is your question equivalent to 'should I eat potatoes tonight?'', they wouldn't know. On the other hand, you could ask them ' if I think you should eat potatoes tonight, is your question equivalent to 'should I eat potatoes tonight?''. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.

I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I'm less sure about how to define the top-level question. It seems like there's only really one question, which is 'what should I do?', and it's going to have to be defined by how the human asker clarifies their meaning. I'm not sure whether the meaning of the question should be allowed to include things the questioner doesn't know at the time of asking.

Load More