All of adamShimi's Comments + Replies

Shapely values are very cool. Let me mention some cool facts:

They arise in (cooperative) game theory but also in ML when doing credit allocation a combined prediction from mixing predictions from different modules of a system.

One piece of evidence of their fundamentalness is that they arise naturally from the Hodge theory on the hypercube of a coalition game: https://arxiv.org/abs/1709.08318

Another interesting fact I learned from Davidad: Shapley values are not compositional: a group of actors can increase their total Shapley value by forming a single caba... (read more)

Oh, didn't know him!

Thanks for the links!

Thanks for the comment!

I agree with you that there are situations where the issue comes from a cultural norm rather than psychological problems. That's one reason for the last part of this post, where we point out to generally positive and productive norms that try to avoid these cultural problems and make it possible to discuss them. (One of the issue I see in my own life with cultural norms is that they are way harder to discuss when in addition psychological problems compound them and make them feel sore and emotional). But you might be right that it's ... (read more)

I think I agree with everything in your comment. Seems like there was less disagreement here than I initially thought. Moving on... :)

Oh, I definitely agree, this is a really good point. What I was highlighting was an epistemic issue (namely the confusion between ideal and necessary conditions) but there is also a different decision theoretic issue that you highlighted quite well.

It's completely possible that you're not powerful enough to work outside the ideal condition. But by doing the epistemic clarification, now we can consider the explicit decision of taking step to become more powerful and being better able to manage non-ideal conditions.

Good point! The difference is that the case explained in this post is one of the most sensible version of confusing the goal and the path, since there the path is actually a really good path. On the other version (like wanting to find a simple theory simply, the path is not even a good one!

In many ways, this post is frustrating to read. It isn't straigthforward, it needlessly insults people, and it mixes irrelevant details with the key ideas.

And yet, as with many of Eliezer's post, its key points are right.

What this post does is uncover the main epistemological mistakes made by almost everyone trying their hands at figuring out timelines. Among others, there is:

  • Taking arbitrary guesses within a set of options that you don't have enough evidence to separate
  • Piling on arbitrary assumption on arbitraty assumption, leading to completely uninforma
... (read more)

I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient.

I expect that most of these factors could be addressed with more work on the scenarios.

I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.

What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!

This insight was already present in Drexle... (read more)

7Daniel Kokotajlo5mo
Just registering my own disagreement here -- I don't think it's a key aspect, because I don't think it's necessary; the bulk of the problem IS about agency & this post encourages us to focus on the wrong problems. I do agree that this post is well written and that it successfully gives proofs of concept for the structural perspective, for there being important problems that don't have to do with agency, etc. I just think that the biggest problems do have to do with agency & this post is a distraction from them. (My opinion is similar to what Kosoy and Christiano said in their comments)
2Raemon5mo
I'd be interested in which bits strike you as notably "imperfect."

I agree that a lot of science relies on predictive hallucinations. But there are examples that come to mind, notably the sort of phenomenological compression pushed by Faraday and (early) Ampère in their initial exploration of electromagnetism. What they did amounted to vary a lot of the experimental condition and relate outcomes and phenomena to each other, without directly assuming any hidden entity. (see this book for more details)

More generally, I expect most phenomenological laws to not rely heavily on predictive hallucinations, even when they integra... (read more)

So reification means "the act of making real" in most english dictionaries (see here for example). That's the meaning we're trying to evoke here, where the reification bias amounts to first postulate some underlying entity that explain the phenomena (that's merely a modelling technique), and second to ascribe reality to this entity and manipulate it as if it was real.

You use the analogy with sports betting multiple time in this post. But science and sports are disanalogical in almost all the relevant ways!

Notably, sports are incredibly limited and well-defined, with explicit rules that literally anyone can learn, quick feedback signals, and unambiguous results. Completely the opposite of science!

The only way I see for the analogy to hold is by defining "science" in a completely impoverished way, that puts aside most of what science actually looks like. For example, replication is not that big a part of acience, it's ju... (read more)

2Vaniver5mo
I mean, the hope is mostly to replace the way that scientists communicate / what constitutes a "paper" / how grantmaking sometimes works. I agree that most of "science" happens someplace else! Like, I think for the typical prediction market, people with questions are dramatically underestimating the difficulty in operationalizing things such that they can reliably resolve it as 'yes' or 'no'. But most scientists writing papers are already running into and resolving that difficulty in a way that you can easily retool.

Agreed! That's definitely and important point, and one reason why it's still interesting to try to prove P \neq NP. The point I was making here was only that when proofs are used for the "certainty" that they give, then strong evidence from other ways is also enough to rely on the proposition.

2Valentine6mo
Yep, agreed. I thought it was a very clever point!

What are you particularly interested in? I expect I could probably write it with a bit of rereading.

4johnswentworth6mo
In terms of methodology, epistemology, etc, what did you do right/wrong? What advice would you today give to someone who produced something like your old goal-deconfusion work, or what did your previous self really need to hear?

Hot take: I would say that most optimization failures I've observed in myself and in others (in alignment and elsewhere) boil down to psychological problems.

2Dagon6mo
This is likely true.  Note that there is an asymmetry between type 1 and type 2 errors in cooperative optimization, though.  The difference between "ok" and "great" is often smaller than the difference between "ok" and "get defected against when I cooperated".  In other words, some things that seem like a prisoners' dilemma actually ARE risky.

Completely agree! The point is not that formalization or axiomatization is always good, but rather to elucidate one counterintuitive way in which it can be productive, so that we can figure out when to use it.

1β-redex6mo
Also I just found that you already argued this in an earlier post [https://www.lesswrong.com/posts/pEB3LrNxvMKFLGBSG/traps-of-formalization-in-deconfusion], so I guess my point is a bit redundant. Anyway, I like that this article comes with an actual example, we could probably use more examples/case studies for both sides of the argument.

Thanks for your thoughtful comment!

First, I want to clarify that this is obviously not the only function of formalization. I feel like this might clarify a lot of the point you raise.

But first, the very idea that formalization would have helped discover non-Euclidean geometries earlier seems counter to the empirical observation that Euclid himself formalized geometry with 5 postulates, how more formal can it get? Compared to the rest of the science of the time, it was a huge advance. He also saw that the 5th one did not fit neatly with the rest. Moreover,

... (read more)

That definitely feels right, with a caveat that is dear to Bachelard: this is a constant process of rectification that repeats again and again. There is no ending, or the ending is harder to find that what we think.

I'm confused by your confusion, given that I'm pretty sure you understand the meaning of cognitive bias, which is quite explicitly the meaning of bias drawn upon here.

A cognitive bias, according to the page you link to, is "a systematic pattern of deviation from norm or rationality in judgment".

This article is not, as I understand it, proposing that human general intelligence is built by piling up deviations from rationality. It is proposing that human general intelligence is built by piling up rules of thumb that "leverage [] local regularities". I agree with Steven: those are heuristics, not biases. The heuristic is the thing you do. The bias is the deviation from rationality that results. It's plausible that in some ... (read more)

Thanks for your comment!

Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.

The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.

Do you think there is still a disagreement after this clarification?

1Linda Linsefors7mo
I think we agreement. I think the confusion is because it is not clear form that section of the post if you are saying  1)"you don't need to do all of these things"  or 2) "you don't need to do any of these things". Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples. But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies. So why would smart, curious, truth-seeking individuals use cached methodologies? Do I do this? Some self-reflection: I did some of this as a PhD student, because I was new, and it was a way to hit the ground running. So, I did some science using the method my supervisor told me to use, while simultaneously working to understand the reason behind this method. I did spend less time that I would have wanted to understand all the assumptions of the sub-sub field of physics I was working in, because of the pressure to keep publishing and because I got carried away by various fun math I could do if i just accepted these assumptions. After my PhD I felt that if I was going to stay in Physics, I wanted to take year or two for just learning, to actually understand Loop Quantum Gravit, and all the other competing theories, but that's not how academia works unfortunately, which is one of the reasons I left. I think that the fundament of good Epistemic is to not have competing incentives.

In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

Thanks for the kind words!

I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is  and the one for the grader optimization is ... (read more)

2Rohin Shah7mo
Two responses: 1. Grader-optimization has the benefit that you don't have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers. 2. Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I'd estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.

Thanks for taking time to answer my questions in detail!

About your example for other failure modes

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to ... (read more)

2TurnTrout7mo
No, the point is that the grader can only grade the current plan; it doesn't automatically know what its counterfactual branches output. The grader is scope-limited to its current invocation. This makes consistent grading harder (e.g. the soup-kitchen plan vs political activism, neither invocation knows what would be given by the other call to the grader, so they can't trivially agree on a consistent scale).

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

  • You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require
... (read more)
1Thane Ruthenis7mo
The crux is likely in a disagreement of which approaches we think are viable. In particular: What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more [https://www.lesswrong.com/posts/heXcGuJqbx3HBmero/people-care-about-each-other-even-though-they-have-imperfect?commentId=rZPPm7SpGtsQvEvCC] skeptical [https://www.lesswrong.com/posts/kmpNkeqEGvFue7AvA/value-formation-an-overarching-model#9__Implications_for_Alignment] of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's latest post yet, so I may be wrong on that). The core issue, as I see it, is that we'll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn't handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along. And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we'll need fine manipulation. Since writing the original post, I've been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post's end), but no idea so far. Yeah, that's a difficulty unique to this approach.

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:

  • Do you think that you are internally trying to approximate your own ?
  • Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
  • Can you think of concrete instances where you improved your own Eval?
  • Ca
... (read more)

> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.

> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and

... (read more)
  1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
  2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
  3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
  4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

O... (read more)

4Rohin Shah7mo
Sounds right. How does this answer my point 4? I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don't really buy that, seems like it depends on the size of the discrepancies. For example, if you imagine an AI that's optimizing for my evaluation of good, I think the discrepancy between "Rohin's directly instilled goals" and "Rohin's CEV" is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I'd conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between "plans Rohin evaluates as good" and "Rohin's directly instilled goals".

A few questions to better understand your frame:

  • You mostly mention two outcomes for the various diamond-maximizer architectures: maximizing the number of diamonds produced and creating hypertuned-fooling-plans for the evaluator. If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?
    • Intuitively if maximizing the number of diamonds and maximizi
... (read more)
4TurnTrout7mo
Really appreciate the good questions! No, there are other failure modes due to unnaturality. Here's something I said in private communication:   So, clarification: if I (not a grader-optimizer) wanted to become a grader-optimizer while pursuing my current goals, I'd need to harden my own evaluation procedures to keep up with my plan-search now being directed towards adversarial plan generation.  Furthermore, for a given designer-intended task (e.g. "make diamonds"), to achieve that with grader-optimization, the designer pays in the extra effort they need to harden the grader relative to just... not evaluating adversarial plans to begin with.  Given an already pointed-to/specified grader, the hardening is already baked in to that grader, and so both evaluation- and values-child should come out about the same in terms of compute usage. I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").  1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. EDIT: I'm most confident in this point if you want your AI to propose plans which you can't generate but can maybe verify.  2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

Thanks for the kind words!

  1. Are there any particular lessons/ideas from Refine that you expect (or hope) SERI MATS to incorporate?

I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic read... (read more)

Thanks for explicitly writing out your thoughts in a place where you can expect strong pushback! I think this is particularly valuable.

That being said, while I completely agree with your second point (I keep telling to people who argue theory cannot work that barely 10 people worked on it for 10 years, which is a ridiculously small number), I feel like your first point is missing some key reflections on the asymmetry of capabilities vs alignment.

I don't have time to write a long answer, but I already have a post going in depth into many of the core assumpt... (read more)

This post is amazing. Not just good, but amazing. You manage to pack exactly the lesson I needed to hear with just the right amount of memes and cheekiness to also be entertaining.

I would genuinely not be surprised if the frame in this post (and the variations I'm already adding to it) proved one of the key causal factors in me being far more productive and optimizing as an alignment researcher.

One suggestion: let's call these trees treeory of change, because that's what they are. ;)

Thanks. Really.

That all makes me very happy to hear. Happier than I remember being in a long time.

Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p)

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thi
... (read more)

It's a charitable (and hilarious) interpretation. What actually happened is that he drafted it by mistake instead of just editing it to add stuff. It should be fine now.

2Peter Wildeford10mo
Thanks!

You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.

I could totally be wrong though, thanks for making this weakness of my description explicit!

That's a great point!

There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.

I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be appl... (read more)

2William_S1y
Yep, that clarifies.

Yeah, I see how it can be confusing. To give an example, Paul Christiano focuses on prosaic alignment (he even coined the term) yet his work is mostly on the conceptual side. So I don't see the two as in conflict.

Thanks for your comment!

Probably the best place to get feedback as a beginner is AI Safety Support. They can also redirect you towards relevant programs, and they have a nice alignment slack.

As for your idea, I can give you quick feedback on my issues with this whole class of solutions. I'm not saying you haven't thought about these issues, nor that no solution in this class is possible at all, just giving the things I would be wary of here:

  • How do you limit the compute if the AI is way smarter than you are?
  • Assuming that you can limit the compute, how much
... (read more)

Basically the distinction is relevant because there are definitely more and more people working on alignment, but the vast majority of the increase actually doesn't focus on formulating solution or deconfusing the main notions; instead they mostly work on (often relevant) experiments and empirical questions related to alignment. 

1p.b.1y
  This seemed to imply that you might be a conceptual alignment researcher, but also work on pure prosaic alignment, which was the point were I thought: Ok, maybe I don't know what "conceptual alignment research" means. But the link definitely clears it up, thank you!
3adamShimi1y
Basically the distinction is relevant because there are definitely more and more people working on alignment, but the vast majority of the increase actually doesn't focus on formulating solution or deconfusing the main notions; instead they mostly work on (often relevant) experiments and empirical questions related to alignment. 

Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!

Does Conjecture/Refine work with anyone remotely or is it all in person?

By default Conjecture is all in person, although right now for a bunch of administrative and travelling reasons we are more disseminated. For Refine it will be in person the whole time. Actually, ensuring that is one big reason we're starting in France (otherwise it would need to be partly remote for administrative reasons)

Having novel approaches to alignment research seems like it could really help the field at this still-early stage. Thanks for creating a program specifically designed to foster this.

You're welcome. ;)

Thanks for the comment!

To be honest, I had more trouble classifying you, and now that you commented, I think you're right that I got the wrong label. My reasoning was that your agenda and directions look far more explicit and precise than Paul or Evan's, which is definitely a more mosaic-y trait. On the other hand, there is the iteration that you describe, and I can clearly see a difference in terms of updating between you and let's say John/Eliezer.

My current model is that you're more palimpsest-y, but compared with most of us, you're surprisingly good at making your current iteration fit into a proper structure that you can make explicit and legible.

(Will update the post in consequence. ;) )

Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.

The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

It seems like a core part of this initial framing relies on the operationalisation of ... (read more)

1LawrenceC10mo
I think here, competent can probably be defined in one of two (perhaps equivalent) ways: 1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. "Most" policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent's behavior as "competent", I'm often making reference to the fact that it achieves high reward according to a "reasonable" reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al's goal misgeneralization paper [https://arxiv.org/abs/2105.14111], which depends on a non-trivial prior over reward functions.  2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking [https://arxiv.org/abs/1912.01683]. That is, they're likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn't exhibit power-seeking or instrumentally convergent drives. The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn't demonstrate power seeking behavior. 

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

I think Steven's point is that if your explanation for modularity leading to broadness is that the parameters inside a module can take any configuration, conditioned on the output of the module staying the same, then you're at least missing an additional step showing that a network consisting of two modules with n/2 parameters each has more freedom in those parameters than a network consisting of one module (just the entire network itself) with n parameters does. Otherwise you're not actually pointing out how this favours modularity over non-modularity.

Whi... (read more)

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by tryin... (read more)

9johnswentworth1y
Yes, although I consider that one more debatable. When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place. Actually, I think starting from a behavioral theorem is fine. It's just not where we're looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.

In what way is AF not open to new ideas? I think it is a bit scary to publish a post here, but that has more to do with it being very public, and less to do with anything specific about the AF. But if AF has a culture of being non welcoming of new ideas, maybe we should fix that?

It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.

Thanks!

I will look at the post soonish. Sorry for the delay in answering, I was in holidays this week. ^^

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would

... (read more)
1RobertKirk1y
This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?
2Kenoubi1y
Thermodynamic? Thermodynamics seems to be about using a small number of summary statistics (temperature, pressure, density, etc.) because the microstructure of the system isn't necessary to compute what will happen at the macro level.

Thanks for this post, it's clear and insightful about RLHF.

From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?

(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)

5Tomek Korbak1y
I'm glad you found our post insightful! I'm not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.
Load More