All of Logan Riggs's Comments + Replies

The Basics of AGI Policy (Flowchart)

From my perspective, it’s more like the opposite; if alignment were to be solved tomorrow, that would give the AI policy people a fair shot at getting it implemented.

I’m unsure what the government can do that DeepMind or OpenAI (or someone else) couldn’t do in their own. Maybe you’re imagining a policy that forces all companies to building aligned AI’s according to the solution, but this won’t be perfect and an unaligned AGI could still kill everyone (or it could be built somewhere else)

The first thing you do with a solution to alignment is build an aligned AGI to prevent all x-risks. I don’t see routing through the government helps that process(?)

The AI Countdown Clock

Why did you use the weak AGI question? Feels like a motte-and-Bailey to say “x time until AGI” but then link to the weak AGI question.

1Akram Choudhary20d
Eliezer seems to think that the shift from proto agi to agi to asi will happen really fast and many of us on this site agree with him thus its not sensible that there is a decade gap between "almost ai" and ai on metaculus . If I recall Turing (I think?) said something similar. That once we know the the way to generate even some intelligence things get very fast after that (heavily paraphrased). So 2028 really is the beginning of the end if we do really see proto agi then.
7River Lewis2mo
I picked it because it has the most predictions and is frequently pointed to as an indicator of big shifts. But you're right, I should work on adding an option to use the strong question instead; I can see why people might prefer that.
Frame for Take-Off Speeds to inform compute governance & scaling alignment

I wonder how much COVID got people to switch to working on Biorisks.

What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.

I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.

Gato as the Dawn of Early AGI

I agree. You can even get career advice here at https://www.aisafetysupport.org/resources/career-coaching

Or feel free to message me for a short call. I bet you could get paid to do alignment work, so it’s worth looking into at least.

Prize for Alignment Research Tasks

[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]

Task: Obvious EA/Alignment Advice

  • Context: There are several common mental motions that the EA community does which are usefully applied to alignment. Ex. "Differential Impact", "Counterfactual Impact", "Can you clearly tell a story on how this reduces x-risk?", and "Truly Optimizing for X". A general "obvious advice" is useful for general capabilities as well, but this i
... (read more)
5alexrjl2mo
I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here [https://ide.elicit.org/run/bBTYr9CtrmKpMY3oZ] if you have an elicit account. Edit: Here's the training data Life problem I need to think of a research topic but I've only had two ideas and both of them aren't that great. Obvious but useful advice. * Have you tried setting a 5-minute timer and, by-the-clock, just trying to write down as many ideas as possible? This can work surprisingly well! Life problem I've been putting off writing this email, and now every time I think about it I feel bad and don't want to do anything at all, especially write the email! Obvious but useful advice. * This seems like a pretty urgent problem to solve, as it's one that will just keep getting worse otherwise as you get negative reinforcement when you think about it. I have a few ideas for things to try: Can you get a friend to sit with you while you write it, or even to write it for you? If you make it your number one priority, can you get it done right now? Is there a way you can not send it, for example by just emailing to say 'sorry, can't reply now, will explain later'? Life problem I'm thinking about quitting my job in finance in order to self-study ML and switch to working on alignment. How can I make the final decision? Obvious but useful advice. * That's an exciting decision to be making! It might be worth writing up the pros and cons of both options in a googledoc, and sharing it with some friends with comment access enabled. Getting your thoughts sorted in a way which is clear to others might be helpful itself, and then also your friends might have useful suggestions or additional considerations! Life problem I'm giving a talk tomorrow, but I'm worried that I
Prize for Alignment Research Tasks

Task: Steelman Alignment proposals

  • Context: Some alignment research directions/proposals have a kernel of truth to them. Steelmanning these ideas to find the best version of it may open up new research directions or, more likely, make the pivot to alignment research easier. On the latter, some people are resistant to change their research direct, and a steelman will only slightly change the topic while focusing on maximizing impact. This would make it easier to convince these people to change to a more alignment-related direction.
  • Input Type: A general resea
... (read more)
Prize for Alignment Research Tasks

Task: Feedback on alignment proposals

  • Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one's alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
  • Input Type: A proposal for a solution to alignment or a general research direction
  • Output Type: Common criticisms or arguments for dead ends for that research direction

Instance 1

Input:

Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w

... (read more)
Convincing People of Alignment with Street Epistemology

Thanks. Yeah this all sounds extremely obvious to me, but I may not have included such obvious-to-Logan things if I was coaching someone else.

Key things to avoid include isolating people from their friends, breaking the linguistic association of words to reality, demanding that someone change their linguistic patterns on the spot, etc - mostly things which street epistemology specifically makes harder due to the recommended techniques

Are you saying street epistemology is good or bad here? I've only seen a few videos and haven't read through the intro documents or anything.

1the gears to ascenscion3mo
Good. people have [edit: some] defenses against abusive techniques and from what I've seen of Street epistemology it's responses to most of those is to knock on the front door rather than trying to sneak in the window, metaphorically speaking.
Convincing All Capability Researchers

I was talking to someone recently who talked to Yann and got him to agree with very alignment-y things, but then a couple days later, Yann was saying very capabilities things instead. 

The "someone"'s theory was that Yann's incentives and environment is all towards capabilities research.

"Mild Hallucination" Test

I think that everyone can see these in theory, but different people focus on different types of information (eg low level sensory information vs high level sensory information) by default. 

I believe drugs or meditating can change which types of information you pay more attention to by default, momentarily or even permanently. 

I've never taken drugs beyond caffeine & alcohol, but meditating makes these phenomena much easier to see. I bet you could get most people to see them if you ask them to e.g. stare at a textured surface like carpet for 2... (read more)

Productive Mistakes, Not Perfect Answers

I understand your point now, thanks. It's:

An embedded aligned agent is desired to have properties (1),(2), and (3). But, suppose (1) & (2), then (3) cannot be true. Then, suppose (2) & ...

or something of the sort. 

1Remmelt3mo
Yeah, that points well to what I meant. I appreciate your generous intellectual effort here to paraphrase back! Sorry about my initially vague and disagreeable comment (aimed at Adam, who I chat with sometimes as a colleague). I was worried about what looks like a default tendency in the AI existential safety community to start from the assumption that problems in alignment are solvable. Adam has since clarified with me that although he had not written about it in the post, he is very much open to exploring impossibility arguments (and sent me a classic paper on impossibility proofs in distributed computing).
Today a Tragedy

Happy Birthday Man. I’d probably have talked to you about AI Alignment by now, and can imagine all the circles we would go arguing it.

I feel like such a different person than even a few years ago, and I don’t think I mean that from a “redefining myself” way or wanting to boost my ego. I wonder how different you’d be after your startup idea.

It’d be nice to have talked to you after Ukraine being invaded, or go see coach about it.

I’ll bring you back if I can,

Logan

Productive Mistakes, Not Perfect Answers

I'm confused on what your point here even is. For the first part, if you're trying to say

research that gives strong arguments/proofs that you cannot solve alignment by doing X (like showing certain techniques aren't powerful enough to prove P!=NP) is also useful.

, then that makes sense. But the post didn't mention anything about that?

You said:

We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).

which I feel... (read more)

We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked.


Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
... it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/compu... (read more)

Productive Mistakes, Not Perfect Answers

It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…

I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.

4Joe_Collman3mo
For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important. My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful. Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like: What target(s) do I aim for if I want to generate the set of ideas with greatest value? I don't think that "Aim for full alignment solution" is the right target here. I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this. (we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots) But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on). Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of: [initially garbage-looking-idea] ---> [important research problem solved] may not be relevant. What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved] It's not good enough if we find ideas that are useful for something, they need to be useful for this. I expect the kinds of processes that work well to look different from those used where there's no fixed problem.
MIRI announces new "Death With Dignity" strategy

Could you link the proven part?

Jhana's seem much healthier, though I'm pretty confused imagining your setup so I don't have much confidence. Say it works and gets past the problems of generalizing reward (eg the brain only rewards for specific parts of research and not others) and ignoring downward spiral effects of people hacking themselves, then we hopefully have people who look forward to doing certain parts of research. 

If you model humans as multi-agents, it's making a certain type of agent (the "do research" one) have a stronger say in what acti... (read more)

1rank-biserial3mo
https://en.wikipedia.org/wiki/Brain_stimulation_reward [https://en.wikipedia.org/wiki/Brain_stimulation_reward] https://doi.org/10.1126/science.140.3565.394 [https://doi.org/10.1126/science.140.3565.394] https://sci-hub.hkvisa.net/10.1126/science.140.3565.394 [https://sci-hub.hkvisa.net/10.1126/science.140.3565.394]
5-Minute Advice for EA Global

Haha, yeah I won some sort of prize like that. I didn't know it because I left right before they announced to go take a break from all those meetings!

MIRI announces new "Death With Dignity" strategy

The better version then reward hacking I can think of is inducing a state of jhana (basically a pleasure button) in alignment researchers. For example, use neuro-link to get the brain-process of ~1000 people going through the jhanas at multiple time-steps, average them in a meaningful way, induce those brainwaves in other people. 

The effect is people being satiated with the feeling of happiness (like being satiated with food/water), and are more effective as a result.

0rank-biserial3mo
* The "electrode in the reward center" setup has been proven to work in humans, whereas jhanas may not tranfer over Neuralink. * Deep brain stimulation is FDA-approved in humans, meaning less (though nonzero) regulatory fuckery will be required. * Happiness is not pleasure; wanting is not liking. We are after reinforcement.
A survey of tool use and workflows in alignment research

Ya, I was even planning on trying:

[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n> 

Then feed that input to.

Planned opinion:

to see if that has some higher-quality summaries. 

Do a cost-benefit analysis of your technology usage

I'm unsure about this now. I think there may generally be way better ways to cope (eg sleeping, walks, reading a book, hanging with friends). 

A different thought: Clarifying the core thing you don't like about having media always on (maybe the compulsion that leads to distractedness) may make your idea easier to communicate and look different in actions/plans produced. Like I'm fine with watching a movie with a friend or playing a video game with my roommate for an hour. 

A slightly different thought: setting alarms on my phone if I'm looking at m... (read more)

Do a cost-benefit analysis of your technology usage

Sure, I'll do it as well. For me: 

  • 1pm - Check snaps, messages, emails, discord team messages and accelerating alignment channel, & EAGx messages. Hard limit at 1:30pm (just set two alarms on my phone)
  • Whitelist - Roam/ research related searches. Phone calls/ texts are unblocked from a certain set of people, who I've told can reach me there. (I set them in the emergency contact list), but besides that, my iPhone is in "focus mode" with all notifications hidden. 
  • Exception handling: I don't think I'll need one, but I can let my roommate know and
... (read more)
2TurnTrout3mo
(I think everyone should have well-defined exception handling, because some of you will have crazy shit happen, like "someone died", and that can make it hard if you're pondering "do I let myself have an allowance now?". Failing to plan is planning to fail (in not-wholly-improbable worlds).)
Do a cost-benefit analysis of your technology usage

I think reading the book and/or trying it yourself would be very informative. You have at least until next Sunday when he reads this comment or potentially writes more.

2TurnTrout3mo
:)
Job Offering: Help Communicate Infrabayesianism

For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:

If you have experience with topology, functional analysis, measure theory, and convex analysis then...

Or

You can get a good sense of InfraBayes from [this post] or [this one]

Or

A list of InfraBayes posts can be found here.

Some (potentially) fundable AI Safety Ideas

Thanks!:)

I’ve recently talked to students at Harvard and about convincing people about alignment (I’m imagining cs/math/physics majors) and how that’s hard because it’s a little inconvenient to be convinced. There were a couple of bottlenecks here:

  1. There’s ~80 people signed up for a “coffee with an x-risk” person talk but only 5 very busy people who are competent enough to give those one-on-ones.
  2. There are many people who have friends/roommates/classmates, but don’t know how to approach the conversation or do it effectively.

For both, training people to ... (read more)

Some (potentially) fundable AI Safety Ideas

Do you have a survey or are you just doing them personally?

One concern is not having well-specified problems in their specific expertise (eg we don't have mesa-optimizers specified as a problem in number theory, and it may not be useful actually), so there's an onboarding process. Or a level of vetting/trust that some of the ones picked can understand the core difficulty and go back-and-forth from formalization to actual-problem-in-reality.

Having both more ELK-like questions and set of lit reviews for each subfield would help. It'd be even better if someon... (read more)

2Algon4mo
I'm devising the survey and thinking about how to approach these people. My questions would probably be of the form How much would it take for you to attend a technical workshop? How much to take a sabbatical to work in a technical field? How much for you to spend X amount of time on problem Y? Yes, we do need something they can work on. That's part of what makes the suvey tricky, because I expect if you said "work on problem X which is relevant to your field" vs "work on problem Y that you know nothing about, and attend a workshop to get you up to speed" would result in very different answers. And knowing what questions to ask them requires a fair bit of background knowledge in AI safety and the mathematicians subfield, so this limits the pool of people that can sensibly work on this. Which is why trying to parralelise things and perhaps set up a group where we can discuss targets and how to best approach them would be useful. I'd also like to be able to contact AI safety folks on who they think we should go after, and which problems we should present as well as perhaps organising some background reading for these guys as we want to get them up to speed as quickly as possible.
Higher Risk of Nuclear War

You’re right. Some people use it to mean “larger than base rates”, and this case, you’re arguing that the chance of nuclear war affecting the US is much larger than it was.

Higher Risk of Nuclear War

I think you’re mixing up “very unlikely” and “very impactful”. I think you can still make the point that a small probability of a huge negative impact is enough to make different decisions than you normally would’ve.

4adamzerner4mo
I actually disagree. Thinking about a raw number like 0.1%, what determines whether it is considered big or small? I think the answer is the context. 0.1% is small if we're talking about the chances that a restaurant gets your order wrong, but big if we're talking about the chances that you win the lottery, I think.
How I Formed My Own Views About AI Safety

No, "why" is correct. See the rest of the sentence:

Write out all the counter-arguments you can think of, and repeat
 

It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.

4shminux4mo
I understand the approach, but this is about finding an accurate model, not about Talmud-style creating and demolishing various arguments against the faith. The questionable framing is as opposed to, say, listing top 10 potential "most important problems to work on", whether related to X-risk or not, and trying to understand what makes a problem "most important" and under what assumptions.
Trust-maximizing AGI

This doesn't mean specification gaming is impossible, but hopefully we'd find a way to make it less likely with a sound definition of what "trust" really means

I think the interesting part of alignment is in defining "trust" in a way that goes against reward hacking/specification gaming, which has been assumed away in this post. I mentioned a pivotal act, defined as an action that has a positive impact on humanity even a billion years away, because that's the end goal of alignment. I don't see this post getting us closer to a pivotal act because, as mention... (read more)

1Karl von Wendt4mo
Thank you! You're absolutely right, we left out the "hard part", mostly because it's the really hard part and we don't have a solution for it. Maybe someone smarter than us will find one.
Trust-maximizing AGI

and it is unclear what might motivate it to switch to deception

You’ve already mentioned it: however you measure trust (eg surveys etc), can be gamed. So it’ll switch strategies once it can confidently game the metric.

You did mention mesa-optimizers, which could still crop up regardless of what you’re directly optimizing (because inner agents are optimizing for other things).

And how could this help us get closer to a pivotal act?

1Karl von Wendt4mo
These are valid concerns. If we had a solution to them, I'd be much more relaxed about the future than I currently am. You're right, in principle, any reward function can be gamed. However, trust as a goal has the specific advantage of going directly against any reward hacking, because this would undermine "justified" long-term trust. An honest strategy simply forbids any kind of reward hacking. This doesn't mean specification gaming is impossible, but hopefully we'd find a way to make it less likely with a sound definition of what "trust" really means. I'm not sure what you mean by a "pivotal act". This post certainly doesn't claim to be a solution to the alignment problem. We just hope to add something useful to the discussion about it.
The Big Picture Of Alignment (Talk Part 1)

How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?

The Big Picture Of Alignment (Talk Part 1)

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.

Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)... (read more)

The Big Picture Of Alignment (Talk Part 1)

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. 

Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe

... (read more)
The Big Picture Of Alignment (Talk Part 1)

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

  • Vast majority don't sort (or are even compilable)
  • The vast majority of programs that "look like they work", don't (eg "forgot a semicolon", "didn't account for an already sorted list", etc)
  • Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don't work. 
    • Could be incomprehensible, pass several unit tests, but still fail in weird edge case
... (read more)
Solving Interpretability Week

Any suggestions for the format in future weeks? Or a criticism of the idea in general?

Solving Interpretability Week

I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.

Solving Interpretability Week

Do you want to co-work? Please include your availability and way to contact you (I personally recommend calendly)

1Evan R. Murphy7mo
I'm interested in trying a co-work call sometime but won't have time for it this week. Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety. (As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)
1Logan Riggs7mo
I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here [https://calendly.com/elriggs/chat?back=1&month=2021-12], I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.
Solving Interpretability Week

What are research directions you want discussed? Is there a framework or specific project you think would further transparency and interpretability? 

Corrigibility Can Be VNM-Incoherent

Summary & Thoughts:

Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:

  1. Weakly-corrigible to policy change pi - if there exists an optimal policy where not disabling is optimal.
  2. Strictly-corrigible - if all optimal policies don’t disable correction.

For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with... (read more)

Solve Corrigibility Week

The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?

In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across tha... (read more)

1Koen.Holtman7mo
Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved. You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it. On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about. I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here. Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of
Solve Corrigibility Week

I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting. 

I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)

1plex7mo
Cool, booked a call for later today.
Solve Corrigibility Week

I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.

This has actually already happened in the document with corrigible either meaning:

  1. Correctable all the time regardless
  2. Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and
... (read more)
2Koen.Holtman7mo
Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress. The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it. To make the same point in another way: the forces which introduce disagreeing viewpoints andlinguistic entropy [https://www.lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corrigibility-2015-2021#Linguistic_entropy] to this forum are stronger than the forces that push towards agreement and clarity. My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations [https://www.lesswrong.com/s/n945eovrA3oDueqtq] and also this one [https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions] . In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read. I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
Solve Corrigibility Week
  • Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me.
  • Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk
... (read more)
Solve Corrigibility Week

Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

1Logan Riggs7mo
* Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me. * Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk on a specific corrigibly research direction and get live feedback/discussion, they can schedule a time to do so. * Breaking up the task comment. Technically the literature review, summaries, extra thoughts is a “task” to do. I do want broken down tasks that many people could do, though what may end up happening is whoever wants a specific task done ends up doing it themselves. Could also have “possible research directions” as a high-level comment.
Solve Corrigibility Week
  • Timelines and forecasting 
  • Goodhart’s law
  • Power-seeking
  • Human values
  • Learning from human feedback
  • Pivotal actions
  • Bootstrapping alignment 
  • Embedded agency 
  • Primer on language models, reinforcement learning, or machine learning basics 
    • This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though,
... (read more)
Solve Corrigibility Week

Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?

3Logan Riggs7mo
* Timelines and forecasting * Goodhart’s law * Power-seeking * Human values * Learning from human feedback * Pivotal actions * Bootstrapping alignment * Embedded agency * Primer on language models, reinforcement learning, or machine learning basics * This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though, I could do my own Google search, but I feel there’s some community value here that won’t be gained. Like learning and teaching together makes it easier for the community to coordinate in the future. Plus connections bonuses.
Solve Corrigibility Week

Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!

Solve Corrigibility Week

Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).

2Diffractor7mo
Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I'm easiest to reach on Discord. The LW Walled Garden can also be nice.
1Logan Riggs7mo
Update: I am available this week until Saturday evening at this calendly link [https://calendly.com/elriggs/chat](though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th [https://calendly.com/elriggs/solving-corrigibility-day] (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!
Corrigibility Can Be VNM-Incoherent

The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.

I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

I don't think I understand the question. Can you rephrase?

Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand. 

Load More