John_Maxwell — LessWrong

See something I've written which you disagree with? I'm experimenting with offering cash prizes of up to US$1000 to anyone who changes my mind about something I consider important. Message me our disagreement and I'll tell you how much I'll pay if you change my mind + details :-) (EDIT: I'm not logging into Less Wrong very often now, it might take me a while to see your message -- I'm still interested though)

Power makes you dumb, stay humble.
Tell everyone in the organization that safety is their responsibility, everyone's views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don't have to take initiative to flag potential problems. (If you think such chats aren't a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)
Accept that not all of the critical feedback you get will be good quality.

The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though -- retaliation against critics seems fairly common here (i.e. the culture of fear didn't come about by chance). Anyone who's been around here long enough will have figured out that sharing your views openly isn't safe. (See also the "People are pretty justified in their fears of critiquing EA leadership/community norms" bullet point here, and the last paragraph in this comment.)

Fair point. I also haven't done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren't many archive readers.

There is some observational evidence that coffee drinking increases lifespan. I think the proposed mechanism has to do with promoting autophagy. https://www.acpjournals.org/doi/10.7326/M21-2977 But it looks like decaf works too. (Decaf has a bit of caffeine.)

I think somewhere else I read that unfiltered coffee doesn't improve lifespan, so try to drink the filtered stuff?

In my experience caffeine dependence is not a big deal and might help my sleep cycle.

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.

Fair enough.

My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.

Yeah personally building feels more natural to me.

I agree a leaderboard would be great. I think it'd be cool to have a leaderboard for proposals as well -- "this proposal has been unbroken for X days" seems like really valuable information that's not currently being collected.

I don't think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.

In the absence of coordination, I think if someone like you was to simply start advertising themselves as an "uberbreaker" who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a "pre-Eliezer" person who I can run my ideas by in a lower stakes context, as opposed to saying "Hey Eliezer, I solved alignment -- wallop me if I'm wrong!"

I wrote a comment on your post with feedback.

I don't have anything prepared for red teaming at the moment -- I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)

I skimmed the post, here is some feedback (context):

I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines. I'm also a bit on the optimistic side about alignment more generally -- it might be better to talk to a pessimist.
This sounds a bit like the idea of a "low-bandwidth oracle".
I think the biggest difficulty is the one you explicitly acknowledged -- boxing is hard.
But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI's output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI's output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there's always a risk that there's a flaw that the superintelligent AGI sees (or can infer) that you don't see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.
The idea of emitting machine-checkable proofs is interesting. I'm not sure such proofs are very useful though. "Finding the right spec is one of the biggest challenges in formal methods." - source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.

I think you'll find it easier to get feedback if you keep your writing brief. Assume the reader's time is valuable. Sentences like "I will mention some stuff later that maybe will make it more clear how I’d think about such a question." should simply be deleted -- make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they're talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.

Personally I'm more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don't have much time to do feedback right now unfortunately.

Thanks for the reply!

As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)

I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd go about solving some aspect or another of the alignment problem, hoping that Cunningham's Law would trigger someone to point out a flaw in my approach. (In some cases I'd already have a flaw in mind along with a way to address it, but I figured it'd be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)

Anyway, it seemed like people often didn't take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I'm doing in my LW user profile now, I've had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn't have expected this to be true -- I'd expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they'll have a different way of looking at things which could address my blind spots.

Lately I've developed a theory for what's going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there's an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who "solved AI alignment", that'll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn't have an obvious status motive. In fact, if you go around shooting down peoples' ideas, that's liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren't signaling any kind of special talent by shooting them down. Hence the "breaker" role ends up being undervalued/disincentivized. Especially doing anything beyond just saying "that won't work" -- finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don't always find such handwaving persuasive.)

I think this might be why Eliezer feels so overworked. He's staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he's had a hard time replacing himself. I think maybe he's tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it's hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.

So yeah, I don't know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a "useless" blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.

A few other thoughts...

it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders

Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms -- obviously not as good as a specific description, but still provides some signal to iterate on.

It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I'd expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).

Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?

Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.

And just as I was writing this I came across another good example of the ‘you think you’re in competition with others like you but mostly you’re simply trying to be good enough’

I'm straight, so possibly unreliable, but I remember Michael Curzi as a very good-looking guy with a deep sexy voice. I believe him when he says other dudes are not competition for him 95% of the time. ;-)

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments