Why did you use the weak AGI question? Feels like a motte-and-Bailey to say “x time until AGI” but then link to the weak AGI question.
I wonder how much COVID got people to switch to working on Biorisks.
What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.
I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.
I agree. You can even get career advice here at https://www.aisafetysupport.org/resources/career-coaching
Or feel free to message me for a short call. I bet you could get paid to do alignment work, so it’s worth looking into at least.
[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]
Input:
... (read more)Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w
Thanks. Yeah this all sounds extremely obvious to me, but I may not have included such obvious-to-Logan things if I was coaching someone else.
Key things to avoid include isolating people from their friends, breaking the linguistic association of words to reality, demanding that someone change their linguistic patterns on the spot, etc - mostly things which street epistemology specifically makes harder due to the recommended techniques
Are you saying street epistemology is good or bad here? I've only seen a few videos and haven't read through the intro documents or anything.
I was talking to someone recently who talked to Yann and got him to agree with very alignment-y things, but then a couple days later, Yann was saying very capabilities things instead.
The "someone"'s theory was that Yann's incentives and environment is all towards capabilities research.
I think that everyone can see these in theory, but different people focus on different types of information (eg low level sensory information vs high level sensory information) by default.
I believe drugs or meditating can change which types of information you pay more attention to by default, momentarily or even permanently.
I've never taken drugs beyond caffeine & alcohol, but meditating makes these phenomena much easier to see. I bet you could get most people to see them if you ask them to e.g. stare at a textured surface like carpet for 2... (read more)
I understand your point now, thanks. It's:
An embedded aligned agent is desired to have properties (1),(2), and (3). But, suppose (1) & (2), then (3) cannot be true. Then, suppose (2) & ...
or something of the sort.
Happy Birthday Man. I’d probably have talked to you about AI Alignment by now, and can imagine all the circles we would go arguing it.
I feel like such a different person than even a few years ago, and I don’t think I mean that from a “redefining myself” way or wanting to boost my ego. I wonder how different you’d be after your startup idea.
It’d be nice to have talked to you after Ukraine being invaded, or go see coach about it.
I’ll bring you back if I can,
Logan
I'm confused on what your point here even is. For the first part, if you're trying to say
research that gives strong arguments/proofs that you cannot solve alignment by doing X (like showing certain techniques aren't powerful enough to prove P!=NP) is also useful.
, then that makes sense. But the post didn't mention anything about that?
You said:
We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).
which I feel... (read more)
We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked.
Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
... it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/compu... (read more)
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.
Could you link the proven part?
Jhana's seem much healthier, though I'm pretty confused imagining your setup so I don't have much confidence. Say it works and gets past the problems of generalizing reward (eg the brain only rewards for specific parts of research and not others) and ignoring downward spiral effects of people hacking themselves, then we hopefully have people who look forward to doing certain parts of research.
If you model humans as multi-agents, it's making a certain type of agent (the "do research" one) have a stronger say in what acti... (read more)
Haha, yeah I won some sort of prize like that. I didn't know it because I left right before they announced to go take a break from all those meetings!
The better version then reward hacking I can think of is inducing a state of jhana (basically a pleasure button) in alignment researchers. For example, use neuro-link to get the brain-process of ~1000 people going through the jhanas at multiple time-steps, average them in a meaningful way, induce those brainwaves in other people.
The effect is people being satiated with the feeling of happiness (like being satiated with food/water), and are more effective as a result.
Ya, I was even planning on trying:
[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n>
Then feed that input to.
Planned opinion:
to see if that has some higher-quality summaries.
I'm unsure about this now. I think there may generally be way better ways to cope (eg sleeping, walks, reading a book, hanging with friends).
A different thought: Clarifying the core thing you don't like about having media always on (maybe the compulsion that leads to distractedness) may make your idea easier to communicate and look different in actions/plans produced. Like I'm fine with watching a movie with a friend or playing a video game with my roommate for an hour.
A slightly different thought: setting alarms on my phone if I'm looking at m... (read more)
Sure, I'll do it as well. For me:
I think reading the book and/or trying it yourself would be very informative. You have at least until next Sunday when he reads this comment or potentially writes more.
For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:
If you have experience with topology, functional analysis, measure theory, and convex analysis then...
Or
You can get a good sense of InfraBayes from [this post] or [this one]
Or
A list of InfraBayes posts can be found here.
Thanks!:)
I’ve recently talked to students at Harvard and about convincing people about alignment (I’m imagining cs/math/physics majors) and how that’s hard because it’s a little inconvenient to be convinced. There were a couple of bottlenecks here:
For both, training people to ... (read more)
Do you have a survey or are you just doing them personally?
One concern is not having well-specified problems in their specific expertise (eg we don't have mesa-optimizers specified as a problem in number theory, and it may not be useful actually), so there's an onboarding process. Or a level of vetting/trust that some of the ones picked can understand the core difficulty and go back-and-forth from formalization to actual-problem-in-reality.
Having both more ELK-like questions and set of lit reviews for each subfield would help. It'd be even better if someon... (read more)
You’re right. Some people use it to mean “larger than base rates”, and this case, you’re arguing that the chance of nuclear war affecting the US is much larger than it was.
I think you’re mixing up “very unlikely” and “very impactful”. I think you can still make the point that a small probability of a huge negative impact is enough to make different decisions than you normally would’ve.
No, "why" is correct. See the rest of the sentence:
Write out all the counter-arguments you can think of, and repeat
It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.
This doesn't mean specification gaming is impossible, but hopefully we'd find a way to make it less likely with a sound definition of what "trust" really means
I think the interesting part of alignment is in defining "trust" in a way that goes against reward hacking/specification gaming, which has been assumed away in this post. I mentioned a pivotal act, defined as an action that has a positive impact on humanity even a billion years away, because that's the end goal of alignment. I don't see this post getting us closer to a pivotal act because, as mention... (read more)
and it is unclear what might motivate it to switch to deception
You’ve already mentioned it: however you measure trust (eg surveys etc), can be gamed. So it’ll switch strategies once it can confidently game the metric.
You did mention mesa-optimizers, which could still crop up regardless of what you’re directly optimizing (because inner agents are optimizing for other things).
And how could this help us get closer to a pivotal act?
How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?
Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.
Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)... (read more)
Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?
Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on.
Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe
Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:
Any suggestions for the format in future weeks? Or a criticism of the idea in general?
I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.
Do you want to co-work? Please include your availability and way to contact you (I personally recommend calendly)
What are research directions you want discussed? Is there a framework or specific project you think would further transparency and interpretability?
Summary & Thoughts:
Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:
For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with... (read more)
The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?
In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.
Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across tha... (read more)
I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting.
I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)
I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.
This has actually already happened in the document with corrigible either meaning:
Fixed! Thanks:)
Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?
Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!
Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).
The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.
I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”
I don't think I understand the question. Can you rephrase?
Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.
I’m unsure what the government can do that DeepMind or OpenAI (or someone else) couldn’t do in their own. Maybe you’re imagining a policy that forces all companies to building aligned AI’s according to the solution, but this won’t be perfect and an unaligned AGI could still kill everyone (or it could be built somewhere else)
The first thing you do with a solution to alignment is build an aligned AGI to prevent all x-risks. I don’t see routing through the government helps that process(?)