APaleBlueDot — LessWrong

Two ideas for alignment, perpetual mutual distrust and induction

If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.

I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.

They key word that confuses me here seems to be "align". How exactly does properly align $A_{n + 1}$ ? How does a human being align a GPT-2 model, for example? What does "align" even mean here?

Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.

The Control Problem: Unsolved or Unsolvable?

APaleBlueDot3y-2-8

It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.

There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).

I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.

Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.

For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.

Unpredictability and the Increasing Difficulty of AI Alignment for Increasingly Intelligent AI

APaleBlueDot3y21

Could this translate to agents having difficulty predicting other agents values and reactions, leading to a lesser likelihood of multiple agent systems acting as one?

An artificially structured argument for expecting AGI ruin

APaleBlueDot3y10

And, sure, but it's not clear why any of this matters? What is the thing that we're going to (attempt) to do with AI, if not use it to solve real-world problems?

It matters because the original poster isn’t saying we don’t use it to solve real world problems, but rather that real world constraints (I.e. laws of physics) will limit its speed of advancement.

An AI likely cannot easily predict a chaotic system unless it can simulate reality at a high fidelity. I guess Op is assuming the TAI won’t have this capability, so even if we do solve real world problems with AI, it is still limited by real world experimentation requirements.

A smart enough LLM might be deadly simply if you run it for long enough

APaleBlueDot3y2-1

I better include the predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects.

Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?

Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it.

Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won't improve itself if it is unable to do so.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments