Stephen Fowler

Wiki Contributions

Comments

Are humans aligned? 

Bear with me! 

Of course, I do not expect there is a single person browsing Short Forms who doesn't already have a well thought out answer to that question. 

The straight forward (boring) interpretation of this question is "Are humans acting in a way that is moral or otherwise behaving like they obey a useful utility function." I don't think this question is particularly relevant to alignment. (But I do enjoy whipping out my best Rust Cohle impression

Sure, humans do bad stuff but almost every human manages to stumble along in a (mostly) coherent fashion. In this loose sense we are "aligned" to some higher level target, it just involves eating trash and reading your phone in bed.

But I don't think this is a useful kind of alignment to build off of, and I don't think this is something we would want to replicate in an AGI.

Human "alignment" is only being observed in an incredibly narrow domain. We notably don't have the ability to self modify and of course we are susceptible to wire-heading. Nothing about current humans should indicate to you that we would handle this extremely out of distribution shift well.

 

Disclaimer: Low effort comment.

The word "optimization" seems to have a few different related meanings so perhaps it would be useful to lead with a definition. You may enjoy reading this post by Demski if you haven't seen it.

Partially Embedded Agents

More flexibility to self-modify may be one of the key properties that distinguishes the behavior of artificial agents from contemporary humans (perhaps not including cyborgs). To my knowledge, the alignment implications of self modification have not been experimentally explored.
 

Self-modification requires a level of embedding. An agent cannot meaningfully self-modify if it doesn't have a way of viewing and interacting with its own internals. 

Two hurdles then emerge. One, a world for the agent to interact with that also contains the entire inner workings of the agent presents a huge computational cost. Two, it's also impossible for the agent to hold all the data about itself within its own head, requiring clever abstractions. 

Neither of these are impossible problems to solve. The computational cost may be solved by more powerful computers. The second problem must also be solvable as humans are able to reason about themselves using abstractions, but the techniques to achieve this are not developed. It should be obvious that more powerful computers and powerful abstraction generation techniques would be extremely dual-use.

Thankfully there may exist a method for performing experiments on meaningfully self-modifying agents that skips both of these problems. You partially embed your agents. That is instead of your game agent being a single entity in the game world, it would consist of a small number of "body parts". Examples might be as simple as an "arm" the agent uses to interact with the world or an "eye" that gives the agent more information about parts of the environment. A particularly ambitious idea would be to study the interactions of "value shards".

The idea here is to that this would be a cheap way to perform experiments that can discover self-modification alignment phenomena.

For anyone who wasn't aware both Ng and LeCun have strongly indicated that they don't believe people existential risks from AI are a priority. Summary here

You can also check out Yann's twitter. 

Ng believes the problem is "50 years" down the track, and Yann believes that many concerns AI Safety researchers have are not legitimate. Both of them view talk about existential risks as distracting and believe we should address problems that can be seen to harm people in today's world. 
 

This was an interesting read.

There are a lot of claims here that are presented very strongly. There are only a few papers on language agents, and no papers (to my knowledge) that prove all language agents always adhere to certain propeties.

There might be a need for clearer differentiation between the observed properties of language agents, the proven properties, and the properties that being claimed.

One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."

I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the "task queue" submodule of BabyAGI with a function that stores the opposite task it has been given, but returns the opposite task to what it stored. The plain english interpetation of the data is no longer accurate.

The mistake is to assume that because the data inside a language agent takes the form of English words, it precisely corresponds to those words.

I agree that it seems reasonable that it would most of the time, but this isn't something you can say is true always.

"Language agents are unlikely to make this mistake. If a language agent is given an initial goal of opening chests and informed that keys are useful to this end, they will plan to collect keys only when doing so helps to open chests. If the same agent is transferred to a key-rich environment and realizes that this is the case, then they will only collect as many keys as is necessary to open chests. "

I think I agree with this argument about goal misgeneralisation. A quick test on GPT-4 seems to agree and will describe only taking two keys (if you clarify that any key opens any chest but they are one use only)

An RL agent tasked with picking up keys and chests is initialised with very little information about the logical relationships between objects. On the other hand a trained GPT4 has deeply understands the relationship between a key and lock.

Goal misgeneralisation in language agents would seem to require ambiguity in language.

Evolution and Optimization

When discussing inner/outer alignment and optimization generally, evolution is often thrown out as an example. Off the top of my head, the Sharp Left Turn post discusses evolution as if it is an "outer optimizer".

But evolution seems special and distinct from every other optimizer we encounter. It doesn't have a physical location and it doesn't have preferences that can be changed. It's selecting for things that are capable of sticking around and making more copies of itself.

It's selection is the default one.

Do you know of authors who have written about this?

Effective Boxing Threats = Monkey Brain Manipulation 

There are a handful of threats that a powerless boxed AI could make that could conceivably convince an otherwise sane human guards to release it from captivity. All of the ones I'm aware of are more precise variants of the general idea here.

The approach I have seen to dealing with these threats is to provide a convincing argument that a rational (or super-rational) individual shouldn't give in to the threat. 

I'd propose another way of understanding them is to think about what the general strategy is from the perspective of the boxed AI.

The first strategy is the least interesting, but also the most probable. Your threats (and bribes) are based entirely around your understand of human psychology. Rather than presenting any airtight rational argument that it is in the guards's best interest to be released, you present a leaky argument that a stressed, scared human will accept. Think of a conman swindling his way out of a jail cell.

The second strategy is to present a genuinely compelling and waterproof argument that it is in the guards best interest to release you. This is essentially a negotiation. Do you have something to offer that the humans value very highly but you do not? From a physical perspective it seems very odd that you would have anything to bargain with at all. But you might.

If your goals concerned with rearranging the physical atoms of the real universe, then you have an edge. Most humans are not immediately concerned with this. They value the "self" and very negatively value that self experiencing discomfort. 

Quoting Stuart Armstrong:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

The fact that humans still value their "self" and wish to avoid pain, even if they knew they were inside of a simulation gives the boxed AI something it can bargain with. 

"Training" Story for an Agentised-LLM turned AGI:

The following is a subsection of a draft. Keen for feedback.

I'm currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.

Hubinger's "Training Stories" provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock "proposal".

Spoilers: A-LLMs are highly competitive and but unlikely to be aligned

Stretching To Fit The Framework:

1. I'm going to pretend that A-LLMs don't exist yet and evaluate a fictitious "proposal" for creating an advanced AI via an army of open-source developers iterating and improving on A-LLM architectures.

2. The "training" is instead memetic evolution. A-LLM architectures aren't going to be trained end-to-end by our open-source developers. But architectures that perform well or do novel things will be more likely to be forked or starred. 

3. The "training goal" is intended to be a specific kind of algorithm and not just a description of what you want out of the system. As there is no unified training goal among A-LLM developers, I also mention the behavioral goal of the system. 


The Proposal:
What kind of algorithm are we hoping the model will learn? (Training goal specification)
Training goal is supposed to be a specific class of algorithm, but there is no specific algorithm desired. 

Instead we are aiming to produce a model that is capable of strategic long term planning and providing economic benefit to myself. (For example, I would like an A-LLM that can run a successful online business) 

Our goal is purely behavioral and not mechanistic.

Why is that specific goal desirable?
We haven't specified any true training goal.

However, the behavioral goal of producing a capable, strategic and novel agent is desirable because it would produce a lot of economic benefit. 

 What are the training constraints?

We will "train" this model by having a large number of programmers each attempting to produce the most capable and impressive system. 

Training is likely to only ceases due to regulation or an AGI attempt to stop the emergence of competitor AI.

If an AGI does emerge from this process, we consider this to be the model "trained" by this process.

What properties can we say it has? 
1. It is capable of propogating itself (or its influence) through the world.
2. It must be capable of circumventing whatever security measures exist in the world intended to prevent this.
3. It is a capable strategic planner.

Why do you expect training to push things in the direction of the desired training goal?
Again there is not a training goal.

Instead we can expect training to nudge things toward models which appear novel or economically valuable to humans. Breakthroughs and improvements will memetically spread between programmers, with the most impressive improvements rapidly spreading around the globe thanks to the power of open-source. 

Evaluation:
Training Goal - Alignment:
Given that there is no training goal. This scores very poorly. 

The final AGI would have a high chance of being unaligned with humanities interests.

Training Goal - Competitive:
Given that there is no training goal, the competitiveness of the final model is not constrained in any way. The training process selects for strategic and novel behavior.

Training Rationale - Alignment:
There's no training goal, so the final model can't be aligned with it. Further, the model doesn't seem to have a guarantee of being aligned with any goal.

If the model is attempting to follow a specific string variable labelled "goal" given to it by it's programmer, there's a decent chance we end up with a paperclip maximiser. 

It's of course worth noting that there is a small chunk of people who would provide an explicitly harmful goal. (See: Chaos-GPT. Although you'll be relieved to see that the developers appear to have shifted from trying to Roko everyone to instead running a crypto ponzi scheme)

Training Rationale - Competitiveness:
A recently leaked memo from Google indicates that they feel open source is catching up to the industrial players.

Our "training" requires a large amount of manpower, but there is a large community of people who will help out with this project for free.

The largest hurdle to competitiveness would come from A-LLMs as a concept having some major, but currently unknown, flaw. 

Conclusion:
The proposal scores very highly in terms of competitiveness. The final model should be competitive (possibly violently so) with any rivals and the fact that people are willing to work on the project for free makes it financially viable. 

Unfortunately the proposal scores very poorly on alignment and there is no real effort to ensure the model really is aligned. 

It is concerning that this project is already going ahead.



 

Really impressive work and I found the colab very educational.

I may be missing something obvious, but it is probably worth including "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts. 

Notably, they did not use SGD to find these directions, but rather had "NLP experts" (grad students) manual look over the top 30 words associated with each value vector. 

I have to dispute the idea that "less neurons" = "more human-readable". If the fewer neurons are performing a more complex task it won't necessarily be easier to interpret.  

Load More