All of Remmelt's Comments + Replies

Sure, I appreciate the open question!

That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.

Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.

This because what we are dealing with is machinery that continue... (read more)

Thanks for your kind remarks.

But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space

Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).

I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying acti... (read more)

if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.

https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9

2mesaoptimizer5d
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.

I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371

Both still require control feedback processes built into the AGI system/infrastructure.

Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?

2mesaoptimizer5d
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI [https://www.lesswrong.com/posts/4RrLiboiGGKfsanMF/the-qaci-alignment-plan-table-of-contents], since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal. I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal. So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper: Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies. -------------------------------------------------------------------------------- 1. For certain definitions of "simply". ↩︎
1Remmelt5d
I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371 [https://dl.acm.org/doi/pdf/10.1145/3603371] Both still require control feedback processes built into the AGI system/infrastructure.

Agreed (and upvoted).

It’s not strong evidence of impossibility by itself.

and thus is motivated to find reasons for alignment not being possible.

I don’t get this sense.

More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.

I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a co... (read more)

The premise that “infinite value” is possible, is an assumption.

This seems a bit like the presumption that “divide by zero” is possible. Assigning a probability to the possibility that divide by zero results in a value doesn’t make sense, I think, because the logical rules themselves rules this out.

However, if I look at this together with your earlier post (http://web.archive.org/web/20230317162246/https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong): I think I get where you’re coming from in that if the agent can conceptualise ... (read more)

-3Donatas Lučiūnas3mo
Why do you think "infinite value" is logically impossible? Scientists do not dismiss possibility that the universe is infinite. https://bigthink.com/starts-with-a-bang/universe-infinite/ [https://bigthink.com/starts-with-a-bang/universe-infinite/]

Great overview! I find this helpful.

Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.

AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – g... (read more)

1Ryan Kidd4mo
Cheers, Remmelt! I'm glad it was useful. I think the extrinsic optimization you describe is what I'm pointing toward with the label "coordination failures," which might properly be labeled "alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned."

Unfortunately, perhaps due to the prior actions of others in your same social group, a deceptive frame of interpretation is more likely to be encountered first, effectively 'inoculating' everyone else in the group against an unbiased receipt of any further information.



Written in 2015.  Still relevant.

Say maybe Illusion of Truth and Ambiguity Effect each are biasing how researchers in AI Safety evaluate one option below. 

If you had to choose, which bias would more likely apply to which option?

  • A:  Aligning AGI to be safe over the long term is possible in principle.
  • B:  Long-term safe AGI is impossible fundamentally.

it needs to plug into the mathematical formalizations one would use to do the social science form of this.

Could you clarify what you mean with a "social science form" of a mathematical formalisation? 
I'm not familiar with this.
 

they're right to look at people funny even if they have the systems programming experience or what have you.

It was expected and understandable that people look funny at the writings from a multi-skilled researcher with new ideas that those people were not yet familiar with. 
Let's move on from first impressions.

 

s

... (read more)
2the gears to ascension5mo
but if we can take the type signature from a simulation, then we can attempt to do formal reasoning about its possibility space given the concrete example. if we don't have precise types, we can't reason through these systems. b seems to me to be a falsifiable claim that cannot be determined true or false from pure rational computation, it requires active investigation. we have evidence of it, but that evidence needs to be cited. How does your approach compare with https://www.metaethical.ai/ [https://www.metaethical.ai/]?

Really appreciate you sharing your honest thoughts her, Rekrul.

From my side, I’d value actually discussing the reasoning forms and steps we already started to outline on the forum. For example, the relevance of intrinsic vs extrinsic selection and correction, or the relevance of the organic vs. artificial substrate distinction. These distinctions are something I would love to openly chat about with you (not the formal reasoning – I’m the bridge-builder, Forrest is the theorist).

That might feel unsatisfactory – in the sense of “why don’t you just give us th... (read more)

BTW, I prefer you being blunt, so glad you’re doing that.

A little more effort to try to understand where we could be coming from would be appreciated. Particularly given what’s at stake here – a full extinction event.

Neither Forrest nor I have any motivation to post unsubstantiated claims. Forrest because frankly, he does not care one bit about being recognised by this community – he just wants to find individuals who actually care enough to consider the arguments rigorously. Me because all I’d be doing is putting my career at risk.

You can't complain about people engaging with things other than your idea if the only thing they can even engage with is your idea.

The tricky thing here is that a few people are reacting by misinterpreting the basic form of the formal reasoning at the onset, and judging the merit of the work by their subjective social heuristics.

Which does not lend me (nor Forrest) confidence that those people would do a careful job at checking the term definitions and reasoning steps – particularly if written in precise analytic language that is unlike the mathematical... (read more)

7[anonymous]5mo
2Remmelt5mo
BTW, I prefer you being blunt, so glad you’re doing that. A little more effort to try to understand where we could be coming from would be appreciated. Particularly given what’s at stake here – a full extinction event. Neither Forrest nor I have any motivation to post unsubstantiated claims. Forrest because frankly, he does not care one bit about being recognised by this community – he just wants to find individuals who actually care enough to consider the arguments rigorously. Me because all I’d be doing is putting my career at risk.

The problem of a very poor signal to noise ratio from messages received from people outside of the established professional group basically means that the risk of discarding a good proposal from anyone regarded as an outsider is especially likely.


This insight feels relevant to a comment exchange I was in yesterday. An AI Safety insider (Christiano) lightly read an overview of work by an outsider (Landry).  The insider then judged the work to be "crankery", in effect acting as a protecting barrier against other insiders having to consider the new ideas... (read more)

[anonymous]5mo1515

Your remarks make complete sense. 

Forest mentioned that for most people, reading his precise "EGS" format will be unparsable unless one has had practice with it. Also agreed that there is no background or context. The "ABSTract" is really too often too brief a note, usually just a reminder what the overall idea is. And the text itself IS internal notes, as you have said. 

He says that it is a good reminder that he should remember to convert "EGS" to normal prose before publishing. He does not always have the energy or time or enthusiasm to do it. ... (read more)

1M. Y. Zuo5mo
Strongly upvoted this simply because it was at -20 overall karma prior.  Considering the vast majority of comments on LW have less work and reasoning put into them then this, I think such a low score was unwarranted. How a claim such as "perpetual motion machines can be known to be 100% impossible with 100% certainty" should be evaluated I'm also curious about.
3TAG5mo
There isn't a single set of axioms that's accepted by everybody.
5paulfchristiano5mo
I think saying "it is possible to be 100% certain that a perpetual motion machine is 100% impossible" would be a forgivable exaggeration. I expected to find a proof because the article makes reference to "our proof" and also claims a level of confidence that seems to imply they have a proof. (Or at least a precise argument.) If you are implying that there is an explicit proof and it's just not in the essay, I think it would be helpful to provide a link to it. This is what I would have expected to find in the essay. (Note that this is not the only or even main reason I dismissed the article, I was just listing the nonstandard usage of the words "proof" and "100% confidence" as one thing that would turn off most researchers. I also think it's bad for clear communication but it's not the end of the world.)

Good to know, thank you. I think I’ll just ditch the “separate claims/arguments into lines” effort.

Forrest also just wrote me: “In regards to the line formatting, I am thinking we can, and maybe should (?) convert to simple conventional wrapping mode? I am wondering if the phrase breaks are more trouble then they are worth, when presenting in more conventional contexts like LW, AF, etc. It feels too weird to me, given the already high weirdness level I cannot help but carry.”

2Viliam5mo
I think my problem is with something other than line breaks (although the line breaks do increase the weird feeling). The text is, essentially, bullet points. There is no introduction, no summary. (Actually, there is an attempt to "ABST", but it's not really legible.) If I don't guess correctly what the author is trying to say, there is very little effort to communicate that to me. This seems like someone's private notes that were not intended for an audience. To compare, this [http://diyhpl.us/~bryan/papers2/physics/Putting%20out%20the%20dark%20fire:%20constraining%20speculative%20physics%20disasters%20-%20Sandberg%20-%202015.pdf] is a text (from the same author) that I can read and understand easily. Because it has sentences, paragraphs, explanations.

Example of a statement with a mere exposure effect: “aligning AGI is possible in principle”

A paper that describes a risk-assessment monoculture in evaluating extinction risks:  Democratising Risk.

Many ordinary people in Western countries do and will have [investments in AI/robots] (if only for retirement purposes), and will therefore receive a fraction of the net output from the robots. 

... Of course, many people today don't have such investments. But under our existing arrangements, whoever does own the robots will receive the profits and be taxed. Those taxes can either fund consumption directly (a citizen's dividend, dole, or suchlike) or (better I think) be used to buy capital investments in the robots - such purchases could be distributed

... (read more)

Appreciating your honesty, genuinely!

Always happy to chat further about the substantive arguments. I was initially skeptical of Forrest’s “AGI-alignment is impossible” claim. But after probing and digging into this question intensely over the last year, I could not find anything unsound (in terms of premises) or invalid (in terms of logic) about his core arguments.

Responding below:

  1. That prior for most problems being solvable is not justified. For starters, because you did not provide any reasons above to justify why beneficial AGI is not like a perpetual motion machine, AKA a “perpetual general benefit machine”.

See reasons to shift your prior: https://www.lesswrong.com/posts/Qp6oetspnGpSpRRs4/list-3-why-not-to-assume-on-prior-that-agi-alignment

  1. Again no reasons given for the belief that AGI alignment is “progressing” or would have a “fair shot” of solving “the problem” if as well resourced as capabilities resea

... (read more)
4Noosphere895mo
I'll concede here that I unfortunately do not have good arguments, and I'm updating towards pessimism regarding the alignment problem.

Let me also copy over Forrest’s (my collaborator) notes here:

  > people who believe false premises tend to take bad actions.

  Argument 3:.

 - 1; That AGI can very easily be hyped so that even smart people
 can be made to falsely/incorrectly believe that there "might be"
 _any_chance_at_all_ that AGI will "bring vastly positive changes".
   - ie, strongly motivated marketing will always be stronger than truth,
   especially when VC investors can be made to think (falsely)
   that they could maybe get 10000X return on investment.
   - that the nature 
... (read more)
1Noosphere895mo
I am honestly very confused on how Forrest is so confident that radical positive changes will not happen in our lifetime. More importantly, he seems to be complaining that his opponents have different goals, and claims they're selectively rational. Heads up, but rational behavior can only be determined once what goals you have are determined. Now, to him, his goals probably are much less selfish than those that want AI progress to speed up, so it's not rational for AI capabilities to increase. I too do not think AI progress is beneficial, and believe it probably is harmful, so I'd slowdown on the progress too. This is critical, because Forrest is misidentifying why AI progress people want AI to progress. The fact that they have very different goals compared to you is the reason why they want AI to progress, and not a rationality failure. Another critical crux is I am far more optimistic than Forrest or Remmelt on AGI Alignment working out in the end. If I had a pessimism level comparable to Forrest or Remmelt, I too would probably advocate far more around governance strategies. This is for several reasons: 1. My general prior is most problems are solvable. This doesn't always occur, see the halting problem's unsolvability, or the likely non-solvability of a perpetual motion machine, but my prior is if there isn't a theorem prohibiting it and it doesn't rely on violating the laws of physics, I'd say it solvable. And AGI alignment is in this spot. 2. I believe alignment is progressing, not enough to be clear, but if AI alignment was as well resourced as AI capabilities research, then I'd give it a fair shot of solving the problem. 3. Finally, time. In the more conservative story described here, it still takes 20-30 years, and while AGI now would probably be incompatible with life due to instrumental convergence and inner alignment failures, so long as you have extremely pessimistic beliefs about progress in AI alignment, t

That’s clarifying. I agree that immediately trying to impose costly/controversial laws would be bad.

What I am personally thinking about first here is “actually trying to clarify the concerns and find consensus with other movements concerned about AI developments” (which by itself does not involve immediate radical law reforms).

We first need to have a basis of common understanding from which legislation can be drawn.

I think there are bunch of relevant but subtle differences in terms of how we are thinking about this. My beliefs after quite a lot of thinking are:

A. Most people don’t care about tech singularity. People are captured by the AI hypes cycles though, especially people who work under the tech elite. The general public is much more wary overall though of current use of AI, and are starting to notice the harms in their daily lives (eg. addictive and ideology + distorted self-image reinforcing social media, exploitative work gigs handed to them by algorithms).

B.... (read more)

2Noosphere896mo
One key point to keep in mind is that my arguments aren't about refuting the idea of slowing down AI, instead it's about offering a reality check. The reason I said baby steps is that 1. They might be enough, but 2. even if it isn't enough, one common failure mode in politics is to go fully maximalist in your agenda first. This is a route to failure for your agenda. It is better instead to progress your agenda from the least controversial/costly, and if necessary go then add more costly/controversial laws. However this is extremely dangerous, a single case of bad publicity or otherwise making it very controversial to govern AI may well doom the effort. Another lesson for politics is that your opposition (AI companies) is probably rational, but having very different goals compared to the median LW/EA person. So we shouldn't expect unusually easy wins in this area, and progress will likely be slow, especially in lobbying. It's still very useful for AI governance to do it, the high risk does not mean there aren't high rewards, especially if you think AI Alignment is possible, but governance can help AI Alignment do it's best, as well as preventing s-risks, but I do think that AI governance may be overestimating what costs the public and companies are willing to bear for regulations. Especially if AI companies can make externalities. For example, the climate change agenda stalled until solar, wind and batteries became cheap enough in the 2010s that moving out of fossil fuels represented a very cheap way to decarbonize. And still there's some opposition here.

Good to read your thoughts.

I would agree that slowing further AI capability generalisation developments down by more than half in the next years is highly improbable. Got to work with what we have.

My mental model of the situation is different.

  1. People engage in positively reinforcing dynamics around social prestige and market profit, even if what they are doing is net bad for what they care about over the long run.

  2. People are mostly egocentric, and have difficulty connecting and relating, particularly in the current individualistic social signalling and

... (read more)
1Noosphere896mo
So we agree that people are selfish/egocentric, essentially. My problem is that from a selfish perspective, even a low chance of the technological singularity (let's say you can survive to see from your perspective essentially a near-heaven) outweighs the high chance harm to the self and others, by multiple orders of magnitude. Arguably more than 10 orders of magnitude. Even most non-narcissists/non-psychopaths would take this deal, and unless convenient plot induced stupidity occurs, we should expect this again and again. So I disagree with numbers 1 and 3, since given their selfishness, they can distribute externalities to others.

This is insightful for me, thank you!

Also, I stand corrected then on my earlier comment on that privacy and digital ownership advocates would/should care about models being trained on their own/person-tracking data such to restrict the scaling of models. I’m guessing I was not tracking well then what people in at least the civil rights spaces Koen moves around in are thinking and would advocate for.

re: Leaders of movements being skeptical of the notion of AGI.

Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks. 

Regarding Jaron Lanier, yes, (re)readi... (read more)

Returning on error correction point:

Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.

Fundamentally though, we'll still be discussing the application limits of error correction methods. 

Three ways to explain why:

  • Any workable AI-alignment method involves receiving input signals, comparing input signals against
... (read more)

Forrest Landry. 

Here is how he described himself before:

   > What is your background?
  > How is it relevant to the work
  > you are planning to do?

  Years ago, we started with a strong focus on
  civilization design and mitigating x-risk.
  These are topics that need and require
  more generalist capabilities, in many fields,
  not just single specialist capabilities,
  in any one single field of study or application.

  Hence, as generalists,
  we are not specifically persons
  who are c

... (read more)

Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures.


Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.

How do you think capability developers would... (read more)

If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other "inspect internals") methods cannot contribute to long-term AGI safety.  And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.

In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking A... (read more)

3ThomasW6mo
Sorry if I missed it earlier in the thread, but who is this "polymath"?
4Remmelt6mo
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed. How do you think capability developers would respond to that statement? Will they just stay on the safe side, saying "Well those alignment researchers say that mechanistic interpretability helps remove intentional deception or harm, but I'm just going to stay on the safe side and not scale any further". No, they are going to use your statement to promote the potential safety of their scalable models, and remove whatever safety margin they can justify themselves taking and feel justified taking for themselves. Not considering unknown unknowns is going to get us killed. Not considering what safety problems may be unsolvable is going to get us killed.  Age-old saying: "It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so."

No, it's not like that. 

It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.

4Remmelt6mo
If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other "inspect internals") methods cannot contribute to long-term AGI safety.  And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community. In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking AI was explicitly plotting to overthrow humanity. That is, except for the extremely unlikely case you pointed to that such intentions are detected and on time, and humans all coordinate at once to impose an effective moratorium on scaling or computing larger models. But this is actually speculation, whereas that OpenAI promoted Olah's fascinating Microscope [https://openai.com/blog/microscope]-generated images as them making progress on understanding and aligning scalable ML models is not speculation.  Overall, my sense is that mechanistic interpretability is used to align-wash capability progress towards AGI, while not contributing to safety where it predominantly matters.

No, it's not like that. 

It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.

I intend to respond to the rest tomorrow.

Some of your interpretations of writings by Timnit Gebru and Glen Weyl seem fair to me (though would need to ask them to confirm). I have not look much into Jaron Lanier’s writings on AGI so that prompts me to google that.

Perhaps you can clarify the other reasons why the changes in learning would be stable in preserving “good properties”? I’ll respond to your nuances regarding how to interpret your long-term-evaluating error correcting code after that.

It's prima facie absurd to think that, e.g. that using interpretability tools to discover that AI models were plotting to overthrow humanity would not help to avert that risk.

I addressed claims of similar forms at least 3 times times already on separate occasions (including in the post itself).

Suggest reading this: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE

“The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of ... (read more)

This is like saying there's no value to learning about and stopping a nuclear attack from killing you because you might get absolutely no benefit from not being killed then, and being tipped off about a threat trying to kill you, because later the opponent might kill you with nanotechnology before you can prevent it.

Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures. And as I said actually being able to show a threat to skeptics is immensely better for all solutions, including relinquishment, than controversial speculation.

A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations,  not bans on using $10B of GPUs instead of $10M in a model". 

I can imagine there being movements that fit this description, in which case I would not focus on talking with them or talking about them. 

But I have not been in touch with any movements matching this description. Perhaps you could share specific examples ... (read more)

As requested by Remmelt I'll make some comments on the track record of privacy advocates, and their relevance to alignment.

I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards ... (read more)

I agree that some specific leaders you cite have expressed distaste for model scaling, but it seems not to be a core concern. In a choice between more politically feasible measures that target concerns they believe are real vs concerns they believe are imaginary and bad, I don't think you get the latter. And I think arguments based on those concerns get traction on measures addressing the concerns, but less so on secondary wishlist items of leaders .

I think that's the reason privacy advocacy in legislation and the like hasn't focused on banning computers i... (read more)

There are plenty of movements out there (ethics & inclusion, digital democracy, privacy, etc.) who are against current directions of AI developments, and they don’t need the AGI risk argument to be convinced that current corporate scale-up of AI models is harmful.

Working with them, redirecting AI developments away from more power-consolidating/general AI may not be that much harder than investing in supposedly “risk-mitigating” safety research.

Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?

A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model. It also seems like it wouldn't pursue measures targeted at the kind of disaster it denies, and might actively discourage them (this sometimes happ... (read more)

In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as "Is it going to kill us all?"... trying to scam people we're simulating for money). I think this is an unrealistically optimistic picture, but I don't see how it's ruled out specifically by the arguments in this post.

This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I'm going to be honest here.

The fact that mechanistic interpretabili... (read more)

This post argues that mechanistic interpretability's scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability's scope of application.

Therefore, this post (and Limited Upside of Interpretability) applies to your comment – by showing the limits of where the comment's premises apply – and not the other way around.
 

To be more specific
You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: "is it (intending to... (read more)

I read your comment before. My post applies to your comment (course-grained predictions based on internal inspection are insufficient).

EDIT: Just responded: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE Thanks for bringing it under my attention again.

3Remmelt6mo
This post argues that mechanistic interpretability's scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability's scope of application. Therefore, this post (and Limited Upside of Interpretability [https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability]) applies to your comment – by showing the limits of where the comment's premises apply – and not the other way around.   To be more specific You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: "is it (intending to) going to kill us all" and "checking whether the AI is still in a training sandbox...and e.g. trying to scam people we're simulating for money". The two examples given are unimaginative in terms of how human-lethal misalignment can (and would necessarily) look like over the long run. They are about the most straightforward AGI misalignment scenarios we could wish to detect for. Here are some desiderata those misalignment threats would need to meet to be sufficiently detectable (such to correct them to not cause (a lot of) harm over the long run): 1. Be fully explicitly represented (at least as a detectable abstraction that is granular enough for the misalignment to be correctable) by code within the AGI. Ie. the misaligned outcomes are not manifested implicitly as a result of the specific iterated interactions between the changing AGI internals and changing connected surroundings of the (more dynamically complex) environment. 2. Be compressible in their fine-grained characteristics (code parameters, connectivity, potential/actual inputs over time, all possible output channels of influence on the environment) such for relevant aspects not to be filtered out by the necessarily lossy data-compression processes of the mechanistic interpretability system, such to be comparable and "aligna

I got some great constructive feedback from Linda Linsefors (which she gave me permission to share). 

On the summary, Linda thinks this is not a good summary. In short, she thinks it highlights some of the weakest parts of the paper, and undersells the most important parts of the paper (eg. survey of impossibility arguments from other academic fields).

Also, that there is too much coverage of generic arguments about AI Safety in the summary. Those arguments make sense in the original post, given the expected audience. But those comments do not make sens... (read more)

I very much agree with your arguments here for re-focussing public explanations around not developing ‘uncontrollable AI’.

Two other reasons why to switch framing:

  1. For control/robotic engineers and software programmers, ‘AGI’ I can imagine is often a far-fetched idea that has no grounding in concrete gears-level principles of engineering and programming. But ‘uncontrollable’ (or unexplainable, or unpredictable) AI is something I imagine many non-ML engineers and programmers in industry to feel intuitively firmly against. Like, you do not want your softwar

... (read more)
2Karl von Wendt8mo
Thank you for your comments, which I totally agree with.

I'm just saying that if any AI with external access would be considered dangerous

 

I'm saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world. 

A specific cruxy statement that I disagree on:

An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even

... (read more)
3Karl von Wendt10mo
Not at all. I'm just saying that if any AI with external access would be considered dangerous, then the same AI without access should be considered dangerous as well. The dynamite analogy was of course not meant to be a model for AI, I just wanted to point out that even an inert mass that in principle any child could play with without coming to harm is still considered dangerous, because under certain circumstances it will be harmful. Dynamite + fire = damage, dynamite w/o fire = still dangerous. Your third argument seems to prove my point: An AI that seems aligned in the training environment turns out to be misaligned if applied outside of the training distribution. If that can happen, the AI should be considered dangerous, even if within the training distribution it shows no signs of it.

Relatedly, a sense I got from reading posts and comments on LessWrong:

A common reason why a researcher ends up Strawmanning another researcher's arguments or publicly claiming that another researcher is Strawmanning another researcher is that they never spend the time and attention (because they're busy/overwhelmed?) to carefully read through the sentences written by the other researcher and consider how those sentences feel ambiguous in meaning and could be open to multiple interpretations.

When your interpretation of what (implicit) claim the author is ar... (read more)

Load More