Thanks for your kind remarks.
But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space
Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).
I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying acti...
if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.
I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371
Both still require control feedback processes built into the AGI system/infrastructure.
Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?
and thus is motivated to find reasons for alignment not being possible.
I don’t get this sense.
More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.
I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a co...
The premise that “infinite value” is possible, is an assumption.
This seems a bit like the presumption that “divide by zero” is possible. Assigning a probability to the possibility that divide by zero results in a value doesn’t make sense, I think, because the logical rules themselves rules this out.
However, if I look at this together with your earlier post (http://web.archive.org/web/20230317162246/https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong): I think I get where you’re coming from in that if the agent can conceptualise ...
Great overview! I find this helpful.
Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.
AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – g...
Unfortunately, perhaps due to the prior actions of others in your same social group, a deceptive frame of interpretation is more likely to be encountered first, effectively 'inoculating' everyone else in the group against an unbiased receipt of any further information.
Written in 2015. Still relevant.
Say maybe Illusion of Truth and Ambiguity Effect each are biasing how researchers in AI Safety evaluate one option below.
If you had to choose, which bias would more likely apply to which option?
it needs to plug into the mathematical formalizations one would use to do the social science form of this.
Could you clarify what you mean with a "social science form" of a mathematical formalisation?
I'm not familiar with this.
they're right to look at people funny even if they have the systems programming experience or what have you.
It was expected and understandable that people look funny at the writings from a multi-skilled researcher with new ideas that those people were not yet familiar with.
Let's move on from first impressions.
...s
Really appreciate you sharing your honest thoughts her, Rekrul.
From my side, I’d value actually discussing the reasoning forms and steps we already started to outline on the forum. For example, the relevance of intrinsic vs extrinsic selection and correction, or the relevance of the organic vs. artificial substrate distinction. These distinctions are something I would love to openly chat about with you (not the formal reasoning – I’m the bridge-builder, Forrest is the theorist).
That might feel unsatisfactory – in the sense of “why don’t you just give us th...
BTW, I prefer you being blunt, so glad you’re doing that.
A little more effort to try to understand where we could be coming from would be appreciated. Particularly given what’s at stake here – a full extinction event.
Neither Forrest nor I have any motivation to post unsubstantiated claims. Forrest because frankly, he does not care one bit about being recognised by this community – he just wants to find individuals who actually care enough to consider the arguments rigorously. Me because all I’d be doing is putting my career at risk.
You can't complain about people engaging with things other than your idea if the only thing they can even engage with is your idea.
The tricky thing here is that a few people are reacting by misinterpreting the basic form of the formal reasoning at the onset, and judging the merit of the work by their subjective social heuristics.
Which does not lend me (nor Forrest) confidence that those people would do a careful job at checking the term definitions and reasoning steps – particularly if written in precise analytic language that is unlike the mathematical...
The problem of a very poor signal to noise ratio from messages received from people outside of the established professional group basically means that the risk of discarding a good proposal from anyone regarded as an outsider is especially likely.
This insight feels relevant to a comment exchange I was in yesterday. An AI Safety insider (Christiano) lightly read an overview of work by an outsider (Landry). The insider then judged the work to be "crankery", in effect acting as a protecting barrier against other insiders having to consider the new ideas...
Your remarks make complete sense.
Forest mentioned that for most people, reading his precise "EGS" format will be unparsable unless one has had practice with it. Also agreed that there is no background or context. The "ABSTract" is really too often too brief a note, usually just a reminder what the overall idea is. And the text itself IS internal notes, as you have said.
He says that it is a good reminder that he should remember to convert "EGS" to normal prose before publishing. He does not always have the energy or time or enthusiasm to do it. ...
Good to know, thank you. I think I’ll just ditch the “separate claims/arguments into lines” effort.
Forrest also just wrote me: “In regards to the line formatting, I am thinking we can, and maybe should (?) convert to simple conventional wrapping mode? I am wondering if the phrase breaks are more trouble then they are worth, when presenting in more conventional contexts like LW, AF, etc. It feels too weird to me, given the already high weirdness level I cannot help but carry.”
A paper that describes a risk-assessment monoculture in evaluating extinction risks: Democratising Risk.
...Many ordinary people in Western countries do and will have [investments in AI/robots] (if only for retirement purposes), and will therefore receive a fraction of the net output from the robots.
... Of course, many people today don't have such investments. But under our existing arrangements, whoever does own the robots will receive the profits and be taxed. Those taxes can either fund consumption directly (a citizen's dividend, dole, or suchlike) or (better I think) be used to buy capital investments in the robots - such purchases could be distributed
Appreciating your honesty, genuinely!
Always happy to chat further about the substantive arguments. I was initially skeptical of Forrest’s “AGI-alignment is impossible” claim. But after probing and digging into this question intensely over the last year, I could not find anything unsound (in terms of premises) or invalid (in terms of logic) about his core arguments.
Responding below:
See reasons to shift your prior: https://www.lesswrong.com/posts/Qp6oetspnGpSpRRs4/list-3-why-not-to-assume-on-prior-that-agi-alignment
Again no reasons given for the belief that AGI alignment is “progressing” or would have a “fair shot” of solving “the problem” if as well resourced as capabilities resea
Let me also copy over Forrest’s (my collaborator) notes here:
> people who believe false premises tend to take bad actions.
Argument 3:.
- 1; That AGI can very easily be hyped so that even smart people
can be made to falsely/incorrectly believe that there "might be"
_any_chance_at_all_ that AGI will "bring vastly positive changes".
- ie, strongly motivated marketing will always be stronger than truth,
especially when VC investors can be made to think (falsely)
that they could maybe get 10000X return on investment.
- that the nature
... That’s clarifying. I agree that immediately trying to impose costly/controversial laws would be bad.
What I am personally thinking about first here is “actually trying to clarify the concerns and find consensus with other movements concerned about AI developments” (which by itself does not involve immediate radical law reforms).
We first need to have a basis of common understanding from which legislation can be drawn.
I think there are bunch of relevant but subtle differences in terms of how we are thinking about this. My beliefs after quite a lot of thinking are:
A. Most people don’t care about tech singularity. People are captured by the AI hypes cycles though, especially people who work under the tech elite. The general public is much more wary overall though of current use of AI, and are starting to notice the harms in their daily lives (eg. addictive and ideology + distorted self-image reinforcing social media, exploitative work gigs handed to them by algorithms).
B....
Good to read your thoughts.
I would agree that slowing further AI capability generalisation developments down by more than half in the next years is highly improbable. Got to work with what we have.
My mental model of the situation is different.
People engage in positively reinforcing dynamics around social prestige and market profit, even if what they are doing is net bad for what they care about over the long run.
People are mostly egocentric, and have difficulty connecting and relating, particularly in the current individualistic social signalling and
This is insightful for me, thank you!
Also, I stand corrected then on my earlier comment on that privacy and digital ownership advocates would/should care about models being trained on their own/person-tracking data such to restrict the scaling of models. I’m guessing I was not tracking well then what people in at least the civil rights spaces Koen moves around in are thinking and would advocate for.
re: Leaders of movements being skeptical of the notion of AGI.
Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks.
Regarding Jaron Lanier, yes, (re)readi...
Returning on error correction point:
Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we'll still be discussing the application limits of error correction methods.
Three ways to explain why:
Forrest Landry.
...Here is how he described himself before:
> What is your background?
> How is it relevant to the work
> you are planning to do?Years ago, we started with a strong focus on
civilization design and mitigating x-risk.
These are topics that need and require
more generalist capabilities, in many fields,
not just single specialist capabilities,
in any one single field of study or application.Hence, as generalists,
we are not specifically persons
who are c
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures.
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.
How do you think capability developers would...
If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other "inspect internals") methods cannot contribute to long-term AGI safety. And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.
In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking A...
No, it's not like that.
It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
No, it's not like that.
It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
I intend to respond to the rest tomorrow.
Some of your interpretations of writings by Timnit Gebru and Glen Weyl seem fair to me (though would need to ask them to confirm). I have not look much into Jaron Lanier’s writings on AGI so that prompts me to google that.
Perhaps you can clarify the other reasons why the changes in learning would be stable in preserving “good properties”? I’ll respond to your nuances regarding how to interpret your long-term-evaluating error correcting code after that.
It's prima facie absurd to think that, e.g. that using interpretability tools to discover that AI models were plotting to overthrow humanity would not help to avert that risk.
I addressed claims of similar forms at least 3 times times already on separate occasions (including in the post itself).
Suggest reading this: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE
“The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of ...
This is like saying there's no value to learning about and stopping a nuclear attack from killing you because you might get absolutely no benefit from not being killed then, and being tipped off about a threat trying to kill you, because later the opponent might kill you with nanotechnology before you can prevent it.
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures. And as I said actually being able to show a threat to skeptics is immensely better for all solutions, including relinquishment, than controversial speculation.
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model".
I can imagine there being movements that fit this description, in which case I would not focus on talking with them or talking about them.
But I have not been in touch with any movements matching this description. Perhaps you could share specific examples ...
As requested by Remmelt I'll make some comments on the track record of privacy advocates, and their relevance to alignment.
I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards ...
I agree that some specific leaders you cite have expressed distaste for model scaling, but it seems not to be a core concern. In a choice between more politically feasible measures that target concerns they believe are real vs concerns they believe are imaginary and bad, I don't think you get the latter. And I think arguments based on those concerns get traction on measures addressing the concerns, but less so on secondary wishlist items of leaders .
I think that's the reason privacy advocacy in legislation and the like hasn't focused on banning computers i...
There are plenty of movements out there (ethics & inclusion, digital democracy, privacy, etc.) who are against current directions of AI developments, and they don’t need the AGI risk argument to be convinced that current corporate scale-up of AI models is harmful.
Working with them, redirecting AI developments away from more power-consolidating/general AI may not be that much harder than investing in supposedly “risk-mitigating” safety research.
Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model. It also seems like it wouldn't pursue measures targeted at the kind of disaster it denies, and might actively discourage them (this sometimes happ...
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as "Is it going to kill us all?"... trying to scam people we're simulating for money). I think this is an unrealistically optimistic picture, but I don't see how it's ruled out specifically by the arguments in this post.
This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I'm going to be honest here.
The fact that mechanistic interpretabili...
This post argues that mechanistic interpretability's scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability's scope of application.
Therefore, this post (and Limited Upside of Interpretability) applies to your comment – by showing the limits of where the comment's premises apply – and not the other way around.
To be more specific
You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: "is it (intending to...
I read your comment before. My post applies to your comment (course-grained predictions based on internal inspection are insufficient).
EDIT: Just responded: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE Thanks for bringing it under my attention again.
I got some great constructive feedback from Linda Linsefors (which she gave me permission to share).
On the summary, Linda thinks this is not a good summary. In short, she thinks it highlights some of the weakest parts of the paper, and undersells the most important parts of the paper (eg. survey of impossibility arguments from other academic fields).
Also, that there is too much coverage of generic arguments about AI Safety in the summary. Those arguments make sense in the original post, given the expected audience. But those comments do not make sens...
I very much agree with your arguments here for re-focussing public explanations around not developing ‘uncontrollable AI’.
Two other reasons why to switch framing:
For control/robotic engineers and software programmers, ‘AGI’ I can imagine is often a far-fetched idea that has no grounding in concrete gears-level principles of engineering and programming. But ‘uncontrollable’ (or unexplainable, or unpredictable) AI is something I imagine many non-ML engineers and programmers in industry to feel intuitively firmly against. Like, you do not want your softwar
I'm just saying that if any AI with external access would be considered dangerous
I'm saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world.
A specific cruxy statement that I disagree on:
...An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even
Relatedly, a sense I got from reading posts and comments on LessWrong:
A common reason why a researcher ends up Strawmanning another researcher's arguments or publicly claiming that another researcher is Strawmanning another researcher is that they never spend the time and attention (because they're busy/overwhelmed?) to carefully read through the sentences written by the other researcher and consider how those sentences feel ambiguous in meaning and could be open to multiple interpretations.
When your interpretation of what (implicit) claim the author is ar...
Sure, I appreciate the open question!
That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.
Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.
This because what we are dealing with is machinery that continue... (read more)